[00:22:07] (03Abandoned) 10Krinkle: Make Wikipedia link on 404 page language-agnostic via Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346264 (https://phabricator.wikimedia.org/T113114) (owner: 10Nemo bis) [00:22:17] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 21 probes of 303 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [00:27:17] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 9 probes of 303 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [01:01:14] (03PS3) 10Krinkle: Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 (owner: 10Brion VIBBER) [01:01:17] (03CR) 10Krinkle: [C: 032] Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 (owner: 10Brion VIBBER) [01:01:57] brion: just gonna roll out that comment change for ya [01:02:56] (03Merged) 10jenkins-bot: Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 (owner: 10Brion VIBBER) [01:03:12] (03CR) 10jenkins-bot: Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 (owner: 10Brion VIBBER) [01:09:26] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: no-op Ib39022b2c37b033d (duration: 01m 00s) [01:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:37] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia [02:25:37] check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is CRITICAL: Test Get media in test p [02:25:37] expected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a t [02:25:47] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:25:47] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:25:57] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:25:58] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia [02:25:58] check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is CRITICAL: Test Get media in test p [02:25:58] expected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a t [02:26:37] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia [02:26:37] check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is CRITICAL: Test Get media in test p [02:26:37] expected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a t [02:32:32] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.29) (duration: 05m 27s) [02:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427050 [05:07:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427050 (owner: 10Marostegui) [05:08:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427050 (owner: 10Marostegui) [05:09:10] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427050 (owner: 10Marostegui) [05:10:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 00m 59s) [05:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:12] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427051 (https://phabricator.wikimedia.org/T187089) [05:11:50] !log Stop MySQL and reboot db1114 to boot up with the new kernel [05:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:12] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427051 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:14:02] (03PS2) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427051 (https://phabricator.wikimedia.org/T187089) [05:16:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427051 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:17:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427051 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:19:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 for alter table (duration: 00m 58s) [05:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:28] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427052 [05:20:30] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4135199 (10Marostegui) There is definitely an impact on which kernel we are running. After running: ``` root@db1114:~# uname -a Linux db1114 4.9.0-4-amd64 #1 SMP Debian 4.9... [05:20:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427051 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:21:19] !log Deploy schema change on db1092 - T187089 T185128 T153182 [05:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:26] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [05:21:26] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:21:26] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:21:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427052 (owner: 10Marostegui) [05:22:42] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427052 (owner: 10Marostegui) [05:24:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 (duration: 00m 58s) [05:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:48] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427052 (owner: 10Marostegui) [05:27:16] (03PS1) 10Marostegui: db-eqiad.php: Restore db1114 main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427053 [05:32:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1114 main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427053 (owner: 10Marostegui) [05:33:57] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1114 main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427053 (owner: 10Marostegui) [05:35:11] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 51981.2173076923 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:35:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore main traffic original weight for db1114 (duration: 00m 58s) [05:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:20] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:36:22] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427054 [05:39:30] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1114 main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427053 (owner: 10Marostegui) [05:41:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427054 (owner: 10Marostegui) [05:42:22] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427054 (owner: 10Marostegui) [05:43:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 in API (duration: 00m 58s) [05:43:35] the kubelet operational latencies alerts are me btw [05:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:38] (03PS1) 10Marostegui: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427055 [05:45:09] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427054 (owner: 10Marostegui) [05:45:20] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 221217.5021459228 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:45:46] <_joe_> akosiaris: ah ok [05:46:00] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 137424.89871086556 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [05:46:20] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 90640.30864197531 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:47:00] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:47:20] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:47:20] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 1441168.0829596412 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [05:47:20] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:48:20] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:50:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427055 (owner: 10Marostegui) [05:51:40] (03Merged) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427055 (owner: 10Marostegui) [05:51:54] (03CR) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427055 (owner: 10Marostegui) [05:52:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give more API traffic to db1114 (duration: 00m 58s) [05:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:42] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427057 [05:56:20] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 126656.25338894683 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [05:57:20] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:01:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427057 (owner: 10Marostegui) [06:02:34] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427057 (owner: 10Marostegui) [06:03:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1114 in API - T191996 (duration: 00m 58s) [06:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:01] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [06:04:34] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427057 (owner: 10Marostegui) [06:18:21] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 66785.25304878049 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:19:21] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:29:41] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:34:41] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:35:50] (03PS1) 10Vgutierrez: install_server: Reimage chromium as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427061 (https://phabricator.wikimedia.org/T187090) [06:35:52] (03PS1) 10Vgutierrez: Remove chromium from eqiad LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/427062 (https://phabricator.wikimedia.org/T187090) [06:36:31] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:40:11] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 51376.75193133048 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:40:52] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage chromium as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427061 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [06:41:18] (03CR) 10Vgutierrez: [C: 032] Remove chromium from eqiad LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/427062 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [06:41:20] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:47:45] !log Depool and reimage chromium as stretch - T187090 [06:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:53] T187090: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090 [06:48:18] !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=chromium.wikimedia.org,service=pdns_recursor [06:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:09] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4135292 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` chromium.wikimedia.org ``` The log can be found in `/var/log/wmf-aut... [07:09:50] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures [07:14:20] (03PS1) 10Elukey: profile::prometheus::alerts: set correct prometheus url for kafka mm [puppet] - 10https://gerrit.wikimedia.org/r/427065 [07:14:34] !log installing perl security updates on trusty [07:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:11] PROBLEM - MariaDB Slave IO: s6 on dbstore2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [07:17:11] PROBLEM - MariaDB Slave IO: s2 on dbstore2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [07:17:32] PROBLEM - MariaDB Slave IO: s5 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:17:32] PROBLEM - MariaDB Slave SQL: s7 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:17:32] PROBLEM - MariaDB Slave SQL: x1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:17:41] checking [07:17:46] that should be me, but I don't know why [07:17:50] PROBLEM - MariaDB Slave IO: s7 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:17:50] PROBLEM - MariaDB Slave SQL: s6 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:17:50] PROBLEM - MariaDB Slave SQL: s5 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:17:51] PROBLEM - MariaDB Slave IO: x1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:18:00] PROBLEM - MariaDB Slave SQL: s2 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:18:09] jynus: looks like OOM [07:18:13] ? [07:18:24] [Tue Apr 17 07:21:37 2018] Out of memory: Kill process 13564 (mysql) score 263 or sacrifice child [07:18:27] did they got killed? [07:18:40] all of them? [07:18:53] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: set correct prometheus url for kafka mm [puppet] - 10https://gerrit.wikimedia.org/r/427065 (owner: 10Elukey) [07:19:20] PROBLEM - Check size of conntrack table on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:19:21] PROBLEM - MariaDB Slave SQL: s8 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:19:21] PROBLEM - Disk space on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:19:30] PROBLEM - MariaDB disk space on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:19:30] PROBLEM - configured eth on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:19:41] PROBLEM - dhclient process on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:19:49] it seems like the host [07:19:50] PROBLEM - DPKG on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:19:51] PROBLEM - mysqld processes on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:19:51] PROBLEM - Check systemd state on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:20:01] PROBLEM - MariaDB Slave IO: s8 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:20:01] PROBLEM - Check whether ferm is active by checking the default input chain on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:20:20] PROBLEM - puppet last run on dbstore2001 is CRITICAL: Return code of 255 is out of bounds [07:20:31] mysql is ok, I think it is nagios [07:21:20] mysql, not mysqld got killed [07:21:30] (aka my query) [07:22:06] Ah cool! [07:22:15] replication is stopped though on a few threads [07:22:21] is that you? [07:22:28] yes [07:22:55] but something is not ok there [07:25:02] I am going to restart the whole thing [07:25:35] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4135310 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['chromium.wikimedia.org'] ``` Of which those **FAILED**: ``` ['chromium.wikimedia.org'] ``` [07:27:02] RECOVERY - Check systemd state on dbstore2001 is OK: OK - running: The system is fully operational [07:27:02] RECOVERY - mysqld processes on dbstore2001 is OK: PROCS OK: 6 processes with command name mysqld [07:27:11] RECOVERY - MariaDB Slave IO: s8 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:27:11] RECOVERY - Check whether ferm is active by checking the default input chain on dbstore2001 is OK: OK ferm input default policy is set [07:27:31] RECOVERY - Check size of conntrack table on dbstore2001 is OK: OK: nf_conntrack is 0 % full [07:27:31] !log restarting dbstore2001 [07:27:32] RECOVERY - MariaDB disk space on dbstore2001 is OK: DISK OK [07:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:41] RECOVERY - Disk space on dbstore2001 is OK: DISK OK [07:27:41] RECOVERY - configured eth on dbstore2001 is OK: OK - interfaces up [07:27:51] RECOVERY - dhclient process on dbstore2001 is OK: PROCS OK: 0 processes with command name dhclient [07:27:52] RECOVERY - DPKG on dbstore2001 is OK: All packages OK [07:30:22] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:34:12] (03PS5) 10Gehel: maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) [07:35:01] (03CR) 10Gehel: [C: 032] maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) (owner: 10Gehel) [07:42:02] !log installing ICU security updates [07:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:45] PROBLEM - MD RAID on chromium is CRITICAL: Return code of 255 is out of bounds [07:44:55] PROBLEM - Recursive DNS on 208.80.154.157 is CRITICAL: CRITICAL - Plugin timed out while executing system call [07:46:08] ^that's me reimaging chromium [07:46:25] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: No response from NTP server [07:47:08] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4135346 (10Marostegui) [07:52:26] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 0.006225 secs [07:52:46] RECOVERY - Recursive DNS on 208.80.154.157 is OK: DNS OK: 0.008 seconds response time. www.wikipedia.org returns 208.80.154.224 [07:53:25] RECOVERY - MariaDB Slave SQL: s8 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:53:35] (03PS1) 10Elukey: profile::kafka::mirror::alerts: introduce prometheus_url_lag_check [puppet] - 10https://gerrit.wikimedia.org/r/427069 [07:54:45] RECOVERY - MD RAID on chromium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:54:45] PROBLEM - Host 2620:0:861:2:7a2b:cbff:fe08:aa48 is DOWN: PING CRITICAL - Packet loss = 100% [07:55:25] (03PS2) 10Elukey: profile::kafka::mirror::alerts: introduce prometheus_url_lag_check [puppet] - 10https://gerrit.wikimedia.org/r/427069 [07:56:22] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4034571 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1003.eqiad.wmnet']... [07:56:28] (03CR) 10Elukey: [C: 032] profile::kafka::mirror::alerts: introduce prometheus_url_lag_check [puppet] - 10https://gerrit.wikimedia.org/r/427069 (owner: 10Elukey) [07:56:35] RECOVERY - MariaDB Slave IO: s2 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:57:05] RECOVERY - MariaDB Slave SQL: s7 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:57:05] RECOVERY - MariaDB Slave IO: s5 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:57:05] RECOVERY - MariaDB Slave SQL: x1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:57:15] RECOVERY - MariaDB Slave IO: s7 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:57:15] RECOVERY - MariaDB Slave SQL: s6 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:57:16] RECOVERY - MariaDB Slave SQL: s5 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:57:16] RECOVERY - MariaDB Slave IO: s6 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:57:25] RECOVERY - MariaDB Slave IO: x1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [07:57:26] RECOVERY - MariaDB Slave SQL: s2 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:58:56] PROBLEM - Host 2620:0:861:2:7a2b:cbff:fe08:aa48 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:2:7a2b:cbff:fe08:aa48) [08:01:17] !log rolling restart of HHVM on video scalers to pick up ICU security update [08:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:55] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka2001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties [08:19:25] !log restart nrpe-server on kafka2001 (kafka check not defined) [08:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:36] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4135384 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1003.eqiad.wmnet'] ``` and were **ALL** successful. [08:30:59] (03PS1) 10Vgutierrez: Revert "Remove chromium from eqiad LVS name server config" chromium has been reimaged successfully. [puppet] - 10https://gerrit.wikimedia.org/r/427076 (https://phabricator.wikimedia.org/T187090) [08:31:24] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove chromium from eqiad LVS name server config" chromium has been reimaged successfully. [puppet] - 10https://gerrit.wikimedia.org/r/427076 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [08:32:53] (03PS2) 10Vgutierrez: Revert "Remove chromium from eqiad LVS name server config" chromium has been reimaged successfully. [puppet] - 10https://gerrit.wikimedia.org/r/427076 (https://phabricator.wikimedia.org/T187090) [08:33:17] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove chromium from eqiad LVS name server config" chromium has been reimaged successfully. [puppet] - 10https://gerrit.wikimedia.org/r/427076 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [08:34:04] (03PS3) 10Vgutierrez: Revert "Remove chromium from eqiad LVS name server config" chromium has been reimaged successfully. [puppet] - 10https://gerrit.wikimedia.org/r/427076 (https://phabricator.wikimedia.org/T187090) [08:34:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove chromium from eqiad LVS name server config" chromium has been reimaged successfully. [puppet] - 10https://gerrit.wikimedia.org/r/427076 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [08:43:19] lovely I'm hitting every use case of the commit validator [08:43:55] (03PS4) 10Vgutierrez: Revert "Remove chromium from eqiad LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/427076 (https://phabricator.wikimedia.org/T187090) [08:44:51] lol [08:44:57] and each more pointless than the other [08:45:55] why? it's a stylel checker, if you think of the commit message as integral part of the code :-P [08:46:00] *style [08:46:51] !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=chromium.wikimedia.org,service=pdns_recursor [08:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:09] (03CR) 10Vgutierrez: [C: 032] Revert "Remove chromium from eqiad LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/427076 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [08:47:34] hashar: by any chance did you had a look at T191764 ? I'm curious about your opinion [08:47:34] T191764: CI: run tests with multiple Python3 versions - https://phabricator.wikimedia.org/T191764 [08:48:25] <_joe_> moritzm: there is no point in arguing about the usefulness of style checkers with volans; he's in love with all those auto-OCD tools [08:48:36] <_joe_> :P [08:48:39] lol [08:48:52] can't argue with that :-) [08:49:02] <_joe_> (I tried to do that multiple times, and failed to convince him) [08:51:35] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4135444 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [08:52:34] vgutierrez: nice! --^ [08:54:45] (03PS1) 10Elukey: profile::zookeeper::server: introduce the zookeeper349 component [puppet] - 10https://gerrit.wikimedia.org/r/427078 (https://phabricator.wikimedia.org/T182924) [08:57:53] (03PS2) 10Elukey: profile::zookeeper::server: introduce the zookeeper349 component [puppet] - 10https://gerrit.wikimedia.org/r/427078 (https://phabricator.wikimedia.org/T182924) [09:00:54] (03CR) 10Muehlenhoff: [C: 031] "One nit, but looks good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427078 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [09:02:27] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:03:03] (03PS3) 10Filippo Giunchedi: Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [09:03:17] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74654 bytes in 0.333 second response time [09:04:22] (03PS3) 10Elukey: profile::zookeeper::server: introduce the zookeeper349 component [puppet] - 10https://gerrit.wikimedia.org/r/427078 (https://phabricator.wikimedia.org/T182924) [09:04:40] (03PS1) 10Jcrespo: labsdb: Equalize weight among analytics replicas [puppet] - 10https://gerrit.wikimedia.org/r/427081 [09:04:44] (03CR) 10Elukey: "Thanks a lot! Going to amend the change.." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427078 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [09:04:48] (03CR) 10jerkins-bot: [V: 04-1] profile::zookeeper::server: introduce the zookeeper349 component [puppet] - 10https://gerrit.wikimedia.org/r/427078 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [09:05:14] argh double quotes [09:06:07] (03PS4) 10Elukey: profile::zookeeper::server: introduce the zookeeper349 component [puppet] - 10https://gerrit.wikimedia.org/r/427078 (https://phabricator.wikimedia.org/T182924) [09:06:48] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10943/ - no op as expected" [puppet] - 10https://gerrit.wikimedia.org/r/427078 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [09:08:44] volans: multiple pythons: in short yes. And based on Stretch that would be nice. I have no idea how to implement that thought [09:08:50] thought -> though [09:10:03] I'm not one of the many debian folks around here, so if you ask me I could say pyenv ;) [09:10:06] * volans hides [09:10:27] (03CR) 10Elukey: [C: 04-1] "Self -1 after https://gerrit.wikimedia.org/r/#/c/427078/, since now even Jessie hosts can run a zookeeper version that doesn't use init's " [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [09:12:03] hashar ^^^ [09:12:38] (03CR) 10Filippo Giunchedi: [C: 032] Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [09:12:44] (03PS4) 10Filippo Giunchedi: Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [09:12:51] (03PS1) 10Pmiazga: Enable PagePreviews for 25% anon users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427082 (https://phabricator.wikimedia.org/T191101) [09:16:33] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4135465 (10ema) Here's pageview hourly after deploying the changes above: {F17025595} US going down, upwards trend for India and Nige... [09:17:06] !log restart xenon-log on mwlog* - T169249 [09:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:12] T169249: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249 [09:25:32] volans: +1 on pyenv :] [09:30:18] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))): 15390616.485042734 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:33:48] (03PS1) 10Elukey: profile::zookeeper::server: fix typo os.version [puppet] - 10https://gerrit.wikimedia.org/r/427084 [09:35:18] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))): 80601185.40111941 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:35:19] (03CR) 10Marostegui: [C: 031] labsdb: Equalize weight among analytics replicas [puppet] - 10https://gerrit.wikimedia.org/r/427081 (owner: 10Jcrespo) [09:36:01] !log reimaging mw1266, mw1267, mw1268 (app servers) to stretch [09:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:05] (03CR) 10Elukey: [C: 032] profile::zookeeper::server: fix typo os.version [puppet] - 10https://gerrit.wikimedia.org/r/427084 (owner: 10Elukey) [09:37:58] !log reimaging mw1280, mw1281, mw1282 (API servers) to stretch [09:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:27] RECOVERY - Request latencies on chlorine is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:38:48] volans: dumped my thoughts on https://phabricator.wikimedia.org/T191764#4135484 [09:39:11] volans: but yeah if you have any experience with pyenv and setting up in a way that causes tox to recognizes it I am for it [09:39:27] volans: then the tox Docker container can be enhanced to rely on pyenv and magic will happen [09:40:17] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes2004.codfw.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes2004.codfw.wmnet}[5m]))): 209936.12004950494 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [09:40:18] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes2001.codfw.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes2001.codfw.wmnet}[5m]))): 214336.42340168872 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [09:40:27] hashar: I use it locally on macos without problems, and tox is happy, but I don't know if there is a more 'debianic' way of doing it [09:41:07] it's rare to find debianic and macos in the same sentence! [09:41:17] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes2004.codfw.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes2004.codfw.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:41:18] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes2001.codfw.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes2001.codfw.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:41:23] * volans wear a cloak of invisibility [09:41:36] ahahahah [09:41:37] volans: we can just pip3 install pyenv :] [09:42:54] and probably instead of just invoking "tox" [09:43:10] we would need something like pyenv init && tox [09:43:15] or whateve rthe magic command [09:43:17] !log mobrovac@tin Started deploy [restbase/deploy@e463fcf]: Use keep-alive for connections to AQS [09:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:27] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))): 160037373.22407407 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:45:57] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 48.15 seconds [09:48:27] RECOVERY - Request latencies on argon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:48:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [09:48:51] godog: yeah, having the full query on the alert message is quite verbose/spammy ;) [09:48:58] checking the 5xx [09:49:02] it might be aqs [09:49:07] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [09:49:38] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [09:49:51] nope, seems a spike for codfw [09:50:10] (causing issues for ulsfo and eqsin) [09:50:29] volans: agreed, I'll trim that down [09:51:28] ack, thanks, no hurry [09:51:51] so failed fetches for codfw https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now [09:52:08] ema,vgutierrez --^ [09:52:13] seems ok now [09:53:22] (03CR) 10Hoo man: [C: 04-1] "Damn old habits :S" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [09:53:27] RECOVERY - Request latencies on chlorine is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:54:27] elukey: that looks suspiciously similar to yesterday's brief spike in esams [09:56:45] :( [09:57:22] (03PS1) 10Filippo Giunchedi: smart: normalize smartctl metric names into Prometheus names [puppet] - 10https://gerrit.wikimedia.org/r/427089 (https://phabricator.wikimedia.org/T86552) [09:58:10] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [09:58:43] (03PS3) 10Muehlenhoff: Remove Varnish config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424552 (https://phabricator.wikimedia.org/T188062) [09:59:38] (03CR) 10Muehlenhoff: [C: 032] Remove Varnish config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424552 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [10:00:30] 10Operations, 10monitoring, 10User-fgiunchedi: Trim long output from check_prometheus_metric - https://phabricator.wikimedia.org/T192343#4135519 (10fgiunchedi) [10:02:14] (03PS2) 10Filippo Giunchedi: smart: normalize smartctl metric names into Prometheus names [puppet] - 10https://gerrit.wikimedia.org/r/427089 (https://phabricator.wikimedia.org/T86552) [10:03:09] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4135534 (10fgiunchedi) Merged the change, let's wait and see if truncation reoccurs [10:03:34] !log mobrovac@tin Finished deploy [restbase/deploy@e463fcf]: Use keep-alive for connections to AQS (duration: 20m 17s) [10:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:04:27] (03CR) 10Filippo Giunchedi: [C: 032] smart: normalize smartctl metric names into Prometheus names [puppet] - 10https://gerrit.wikimedia.org/r/427089 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [10:13:20] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [10:19:34] (03PS3) 10Muehlenhoff: Remove LVS/pybal config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424553 (https://phabricator.wikimedia.org/T188062) [10:23:25] (03PS1) 10Elukey: role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw [puppet] - 10https://gerrit.wikimedia.org/r/427090 (https://phabricator.wikimedia.org/T182924) [10:26:12] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) [10:27:28] (03CR) 10Elukey: "Pcc: https://puppet-compiler.wmflabs.org/compiler02/10946/" [puppet] - 10https://gerrit.wikimedia.org/r/427090 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [10:27:41] (03CR) 10Muehlenhoff: [C: 032] Remove LVS/pybal config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424553 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [10:28:23] godog: can I merge your smart patch along? [10:30:18] change seems safe, so I went ahead [10:32:13] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:32:14] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:32:31] moritzm: ugh, yes sorry about that [10:32:33] PROBLEM - Apache HTTP on mw1268 is CRITICAL: connect to address 10.64.0.63 and port 80: Connection refused [10:32:33] PROBLEM - MD RAID on mw1268 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:32:33] PROBLEM - HHVM rendering on mw1280 is CRITICAL: connect to address 10.64.0.75 and port 80: Connection refused [10:32:34] PROBLEM - HHVM rendering on mw1281 is CRITICAL: connect to address 10.64.0.76 and port 80: Connection refused [10:32:34] PROBLEM - nutcracker port on mw1280 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:32:34] PROBLEM - nutcracker port on mw1281 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:32:50] ^ reimage spam, silencing [10:36:53] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:38:03] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:39:34] !log lvs200[63]: restart pybal to apply https://gerrit.wikimedia.org/r/424553 T188062 [10:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:40] T188062: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062 [10:39:53] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))): 20722932.302158274 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:40:33] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:40:54] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))): 13629214.507095551 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:42:53] RECOVERY - Request latencies on chlorine is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:42:54] RECOVERY - Request latencies on argon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:43:03] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:38] what's going on with those k8s alerts? [10:47:34] RECOVERY - Apache HTTP on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [10:48:24] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:52:35] !log lvs100[63] restart pybal to apply https://gerrit.wikimedia.org/r/424553 T188062 [10:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:41] T188062: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062 [10:54:34] we are having since yesterday an unusually high number of accounts being created on meta-wiki [10:54:44] not autocreated, but created there [10:55:08] <_joe_> Hauskatze: this channel is public and logged; open a security ticket maybe? [10:55:23] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 3.85, 4.57, 3.27 [10:55:53] RECOVERY - nutcracker port on mw1280 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:55:54] _joe_: I know. Not sure if it'd warrant a ticket. I don't want to waste anyone's time. [10:56:30] <_joe_> Hauskatze: heh I can't really tell [10:57:03] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove conftool configuration for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/425982 (owner: 10Muehlenhoff) [10:57:23] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 3.92, 4.67, 3.35 [10:57:44] RECOVERY - MD RAID on mw1268 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:57:53] RECOVERY - nutcracker port on mw1281 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:58:41] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. [11:00:55] _joe_: done: https://phabricator.wikimedia.org/T192350 [11:01:12] <_joe_> Hauskatze: thanks, the security people will notice it [11:01:41] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:07] (03PS1) 10Muehlenhoff: Switch more mediawiki servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427091 [11:04:52] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 75242 bytes in 0.643 second response time [11:06:16] (03CR) 10Muehlenhoff: [C: 032] Switch more mediawiki servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/427091 (owner: 10Muehlenhoff) [11:06:52] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 75244 bytes in 1.507 second response time [11:09:16] (03CR) 10Hoo man: "@Addshore: What's up with this? Can we schedule this for deployment?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395486 (owner: 10Addshore) [11:11:50] (03CR) 10Addshore: "Yup, it's all ready to go afaik, I just haven't had any time to schedule it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395486 (owner: 10Addshore) [11:13:39] (03PS1) 10Ema: Remove rendering [dns] - 10https://gerrit.wikimedia.org/r/427092 (https://phabricator.wikimedia.org/T188062) [11:15:17] (03PS1) 10Muehlenhoff: Remove discovery hiera settings for the image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) [11:17:56] (03PS2) 10Ema: Remove imagescaler discovery entries / rendering records [dns] - 10https://gerrit.wikimedia.org/r/427092 (https://phabricator.wikimedia.org/T188062) [11:19:40] (03PS2) 10Jcrespo: labsdb: Equalize weight among analytics replicas [puppet] - 10https://gerrit.wikimedia.org/r/427081 [11:19:42] (03PS1) 10Jcrespo: mariadb: Add labsdbclient section to root.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/427094 [11:20:39] (03CR) 10Jcrespo: [C: 032] mariadb: Add labsdbclient section to root.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/427094 (owner: 10Jcrespo) [11:20:59] (03CR) 10Jcrespo: [C: 032] labsdb: Equalize weight among analytics replicas [puppet] - 10https://gerrit.wikimedia.org/r/427081 (owner: 10Jcrespo) [11:22:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove imagescaler discovery entries / rendering records [dns] - 10https://gerrit.wikimedia.org/r/427092 (https://phabricator.wikimedia.org/T188062) (owner: 10Ema) [11:24:18] (03CR) 10Ema: [C: 032] Remove imagescaler discovery entries / rendering records [dns] - 10https://gerrit.wikimedia.org/r/427092 (https://phabricator.wikimedia.org/T188062) (owner: 10Ema) [11:26:00] (03PS2) 10Muehlenhoff: Remove discovery hiera settings for the image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) [11:27:59] (03CR) 10Ema: [C: 031] Remove discovery hiera settings for the image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:32:14] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "remove the conftool data in the patch in which you remove the conftool references." [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:33:42] (03PS3) 10Muehlenhoff: Remove discovery hiera settings for the image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) [11:38:06] (03CR) 10Muehlenhoff: "Ack, updated in PS3 of this patch and in PS3 of https://gerrit.wikimedia.org/r/427093" [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:41:38] (03PS3) 10Muehlenhoff: Remove conftool configuration for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/425982 [11:42:04] (03CR) 10jerkins-bot: [V: 04-1] Remove conftool configuration for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/425982 (owner: 10Muehlenhoff) [11:42:52] (03PS4) 10Muehlenhoff: Remove conftool configuration for image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/425982 [11:43:19] (03CR) 10jerkins-bot: [V: 04-1] Remove conftool configuration for image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/425982 (owner: 10Muehlenhoff) [11:45:22] (03PS5) 10Muehlenhoff: Remove conftool configuration for image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/425982 (https://phabricator.wikimedia.org/T188062) [11:51:21] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove discovery hiera settings for the image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:52:05] (03CR) 10Mobrovac: [C: 031] role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw [puppet] - 10https://gerrit.wikimedia.org/r/427090 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [11:52:07] (03PS4) 10Muehlenhoff: Remove discovery hiera settings for the image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) [11:52:31] (03Abandoned) 10Mobrovac: Revert switching the ChangeNotification job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426911 (https://phabricator.wikimedia.org/T192198) (owner: 10Ppchelko) [11:52:49] (03CR) 10Muehlenhoff: [C: 032] Remove discovery hiera settings for the image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/427093 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:56:45] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:00:05] mobrovac: Your horoscope predicts another unfortunate Page Previews roll-out to enwiki deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1200). [12:00:05] raynor: A patch you scheduled for Page Previews roll-out to enwiki is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] ok let's go [12:00:13] * mobrovac taking over tin [12:01:24] (03CR) 10Mobrovac: [C: 032] Enable PagePreviews for 25% anon users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427082 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [12:01:34] (03PS1) 10Muehlenhoff: Switch former image scalers to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/427096 (https://phabricator.wikimedia.org/T188062) [12:02:15] (03Merged) 10jenkins-bot: Enable PagePreviews for 25% anon users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427082 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [12:03:15] starting sync now ryasmeen [12:03:20] ok [12:03:22] ups i meant raynor [12:03:25] :P [12:03:44] the more the merrier [12:03:46] :) [12:03:48] :) [12:03:57] (03PS1) 10Muehlenhoff: Remove Prometheus config for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427097 (https://phabricator.wikimedia.org/T188062) [12:04:31] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Enable page previews for 25% of annons on enwiki - T191101 (duration: 01m 03s) [12:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:38] ok sync done [12:04:39] T191101: Let's do this: Rollout page previews to 100% of anons on English Wikipedia - https://phabricator.wikimedia.org/T191101 [12:06:15] mwdebug1002 ? mobrovac ? [12:06:39] raynor: all of it, there was no error rate increase [12:07:10] mobrovac: ok, we'll start testing [12:09:35] mobrovac, on debug servers the rate is still 0.1, I checked couple production hosts [12:09:55] k [12:10:38] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:10:54] you said that first it gets deployed only to small subset of servers [12:11:05] and then when there are no errors it gets deployed to all, right? [12:13:38] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:35] yes, that's done automatically by the sync tool [12:21:15] (03CR) 10Giuseppe Lavagetto: [C: 031] "You might want to remove the hiera configuration for role::mediawiki::imagescaler as well, either here or in another patch is ok" [puppet] - 10https://gerrit.wikimedia.org/r/427096 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [12:22:25] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove conftool configuration for image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/425982 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [12:25:18] Mh, at least one EventLogging Processor has had a connection error with Kafka [12:25:26] but it's back on track [12:25:51] mobrovac: do you know how much time it takes to push the config change to all servers? [12:25:57] I'm still getting only 10% [12:26:12] raynor: it's been done [12:26:29] raynor: when i told you it was synced, that's when the value got deployed everywhere [12:26:34] hm, still getting 10?% [12:26:37] hm mh [12:26:47] mforns: lemme know if you need help [12:26:54] elukey, sure :] [12:27:15] raynor: getting 10% consistently or sporadically? [12:27:20] consistently [12:27:26] on debug and prod servers [12:27:29] Pchelolo: mobrovac: whenever the PagePreviews change is done, feel free to proceed with the SWAT one ( [EventBus][no-op] Support per-event-type EventBus enabling configuration https://gerrit.wikimedia.org/r/#/c/425888/ ) [12:27:53] Pchelolo: mobrovac: no need to wait exactly 13:00 UTC to roll it Iguess. As log as the current PagePreviews deployment is completed [12:28:09] raynor: how is the value propagated to clients? it might be cached somewhere? [12:28:23] kk hashar, will do, thnx [12:28:28] it's config change usually it goes pretty fast [12:30:11] oh damn [12:30:20] sorry raynor my bad, i'll sync again [12:31:12] np [12:31:44] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Enable page previews for 25% of annons on enwiki, take #2 - T191101 (duration: 00m 58s) [12:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:50] T191101: Let's do this: Rollout page previews to 100% of anons on English Wikipedia - https://phabricator.wikimedia.org/T191101 [12:31:59] ok done now olliv, raynor [12:32:17] yup, I see that [12:32:19] thx [12:32:24] (03PS1) 10Jcrespo: mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427100 (https://phabricator.wikimedia.org/T192358) [12:33:37] (03CR) 10jenkins-bot: Enable PagePreviews for 25% anon users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427082 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [12:34:46] I see events coming in :] [12:39:33] (03PS1) 10Vgutierrez: ntp: Cleanup jessie only code [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) [12:40:49] mobrovac, looks good, we did some smoke tests and it works properly [12:41:00] we will keep testing but so far it's good, everything works \o/ [12:41:17] I'll create a patch for 50% [12:45:24] On the EventLogging side, looks good, there's an expected increase in CPU usage, also I've seen a couple validation errors, due to very long payloads (probably page titles with unicode characters) [12:45:33] but they are very few, expected [12:46:11] actually, those errors do not belong to enwiki [12:46:19] gr8, on the services side all good as well [12:47:12] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 902580.5784543327 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:47:22] akosiaris: ^ [12:47:37] akosiaris: also, saw an error in rb about mathoid timing out [12:48:03] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 519171.06398809515 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [12:48:05] yeah known, I 've just switched over mathoid from "fantastic-condor" to "production" [12:48:20] this is just the kubelets complaining they had work to do [12:48:24] we need to tune those checks [12:48:37] (03PS2) 10Vgutierrez: ntp: Cleanup jessie only code [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) [12:48:38] and the timing out was expected as well [12:48:44] but it should be ok now [12:49:02] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:49:12] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:49:12] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:49:33] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:51:12] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:52:32] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4135923 (10Pchelolo) 05Open>03Resolved The deployment-mediawiki04.deployment-prep.eqiad.w... [12:55:10] (03PS1) 10Pmiazga: Enable PagePreviews for 50% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427103 (https://phabricator.wikimedia.org/T191101) [12:57:12] jouncebot: next [12:57:12] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1300) [12:57:26] mobrovac: o/ - is it a super busy time or can I upgrade zookeeper in main-codfw? (it would require a quick check from your side that nothing is on fire) [12:57:34] just tested the procedure in labs [12:58:24] yup, elukey, you can go ahead now [12:58:33] elukey: how long do you suspect it would take? [12:59:18] mobrovac: I think 10/15 mins top [12:59:30] kk that work [12:59:31] s [12:59:52] super [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1300). [13:00:04] Pchelolo: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] !log upgrade zookeeper on conf200[123] to 3.4.9~jessie - T182924 [13:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:26] T182924: Refresh zookeeper nodes in eqiad - https://phabricator.wikimedia.org/T182924 [13:00:36] (03PS2) 10Elukey: role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw [puppet] - 10https://gerrit.wikimedia.org/r/427090 (https://phabricator.wikimedia.org/T182924) [13:01:39] mobrovac, olliv, raynor: I thought 25% enwiki anonymous would signify a higher increase.. I can barely see a change in: https://tinyurl.com/ya32exsa (note y-axis is not based at 0) [13:01:52] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427104 (https://phabricator.wikimedia.org/T191996) [13:02:38] (03CR) 10Elukey: [C: 032] role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw [puppet] - 10https://gerrit.wikimedia.org/r/427090 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [13:03:46] mforns: i see a slight increase in restbase rates for the summary end point in https://grafana-admin.wikimedia.org/dashboard/db/restbase?panelId=15&fullscreen&orgId=1&from=now-3h&to=now (since 12:30) [13:03:52] but those are uncached reqs [13:03:58] so cached reqs should be much higher [13:04:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427104 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [13:05:17] mobrovac: I am proceeding with conf2001 [13:05:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427104 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [13:05:24] kk elukey [13:06:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 to get it ready for a network cable change (duration: 00m 58s) [13:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:55] mobrovac: upgraded, all good from my side (this one is a follower) [13:07:36] checking the graphs and logs [13:09:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427104 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [13:09:43] mobrovac: proceeding with 2002 ok? [13:09:47] (another follower) [13:10:09] elukey: yup, all good from our side [13:10:44] (03PS2) 10Jcrespo: mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427100 (https://phabricator.wikimedia.org/T192358) [13:12:31] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427100 (https://phabricator.wikimedia.org/T192358) (owner: 10Jcrespo) [13:13:41] (03Merged) 10jenkins-bot: mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427100 (https://phabricator.wikimedia.org/T192358) (owner: 10Jcrespo) [13:13:41] mobrovac: I am checking metrics in https://grafana.wikimedia.org/dashboard/db/zookeeper, latency for 2001 went up but not sure if temporary or not [13:13:50] waiting a sec just in case [13:13:52] k [13:14:45] (03CR) 10jenkins-bot: mariadb: Depool db1067 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427100 (https://phabricator.wikimedia.org/T192358) (owner: 10Jcrespo) [13:14:53] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))): 104353253.29130433 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:15:12] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))): 87920176.1306471 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:15:42] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 (duration: 00m 58s) [13:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:10] 2002 just upgraded [13:17:53] RECOVERY - Request latencies on chlorine is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:18:12] RECOVERY - Request latencies on argon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:18:16] everything seems good [13:19:00] (03CR) 10Muehlenhoff: [C: 032] Switch former image scalers to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/427096 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [13:19:05] (03PS2) 10Muehlenhoff: Switch former image scalers to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/427096 (https://phabricator.wikimedia.org/T188062) [13:20:58] (03PS1) 10Alexandros Kosiaris: Add a new helm puppet module [puppet] - 10https://gerrit.wikimedia.org/r/427109 [13:23:18] mobrovac: all done! leader migrated from 2003 to 2002, the cluster seems working fine with the new version [13:23:32] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:23:50] (03CR) 10Ottomata: "Hm, but, profile::kafka::mirror::alerts has this defaulting to "http://prometheus.svc.${::site}.wmnet/ops"" [puppet] - 10https://gerrit.wikimedia.org/r/427065 (owner: 10Elukey) [13:24:10] elukey: yay, all looking good on our side as well :) [13:24:47] (03CR) 10Ottomata: "Oh, I see future patch, ok..." [puppet] - 10https://gerrit.wikimedia.org/r/427065 (owner: 10Elukey) [13:25:03] (03CR) 10Elukey: [C: 032] "> Hm, but, profile::kafka::mirror::alerts has this defaulting to" [puppet] - 10https://gerrit.wikimedia.org/r/427065 (owner: 10Elukey) [13:25:41] !log completed migration of zookeeper on conf200[123] [13:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:51] mobrovac: thanks for the support! Lemme know if you see anything weird [13:26:07] thnx elukey! [13:26:30] !log updating puppet compiler facts [13:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:40] volans: <3 beers++ [13:26:52] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:26:54] mobrovac: I'll wait a couple of days to see if anything comes up, then I'll upgrade main-eqiad too [13:27:04] sounds good [13:29:02] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:29:13] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:30:00] !log starting backup from db1067, may generate some lag [13:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:27] !log removed role::mediawiki::imagescaler from deployment-prep, per watroles the only use of that role in WMCS [13:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:04] !log removed role::mediawiki::imagescaler from deployment-mediawiki05, per watroles the only use of that role in WMCS [13:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:42] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:36:52] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler02/10950/" [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [13:36:52] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:37:12] PROBLEM - Request latencies on acrux is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.0.93:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.0.93:6443}[5m]))): 34183089.84578885 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:39:27] (03PS2) 10Alexandros Kosiaris: Add a new helm puppet module [puppet] - 10https://gerrit.wikimedia.org/r/427109 [13:39:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add a new helm puppet module [puppet] - 10https://gerrit.wikimedia.org/r/427109 (owner: 10Alexandros Kosiaris) [13:39:52] PROBLEM - Request latencies on acrab is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))): 63164634.67105261 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:41:13] RECOVERY - Request latencies on acrux is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.0.93:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.0.93:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:41:30] (03PS1) 10Ottomata: Allow rsyncing to dumps pagecounts-ez and media from dumps peer hosts [puppet] - 10https://gerrit.wikimedia.org/r/427111 (https://phabricator.wikimedia.org/T189283) [13:42:29] (03PS1) 10Alexandros Kosiaris: helm: Fix call to helm init [puppet] - 10https://gerrit.wikimedia.org/r/427113 [13:42:50] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] helm: Fix call to helm init [puppet] - 10https://gerrit.wikimedia.org/r/427113 (owner: 10Alexandros Kosiaris) [13:42:52] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/extension.json: Support per-event dispatch of events, file 1/3 - T191464 (duration: 03m 07s) [13:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:58] T191464: Enable CP4JQ support for private wikis - https://phabricator.wikimedia.org/T191464 [13:43:47] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10951/labstore1007.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/427111 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [13:43:53] (03PS2) 10Ottomata: Allow rsyncing to dumps pagecounts-ez and media from dumps peer hosts [puppet] - 10https://gerrit.wikimedia.org/r/427111 (https://phabricator.wikimedia.org/T189283) [13:43:57] (03CR) 10Ottomata: [V: 032 C: 032] Allow rsyncing to dumps pagecounts-ez and media from dumps peer hosts [puppet] - 10https://gerrit.wikimedia.org/r/427111 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [13:44:36] (03PS1) 10Jcrespo: mariadb: Add dbstore_multiinstance profile to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) [13:46:12] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[helm-init] [13:47:53] RECOVERY - Request latencies on acrab is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:48:48] (03PS1) 10Filippo Giunchedi: nagios: more understandable output for check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) [13:49:10] olliv: we'll need to start slightly later for the next round, some mw servers need some changes applied [13:49:15] we are still on track, just a bit late [13:49:27] mobrovac: sounds good, just let us know when [13:49:30] (03PS2) 10Jcrespo: mariadb: Add dbstore_multiinstance profile to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) [13:49:41] (03PS1) 10Alexandros Kosiaris: helm: Remove dependency on helm-init for /etc/helm [puppet] - 10https://gerrit.wikimedia.org/r/427116 [13:49:52] will do olliv [13:50:35] (03CR) 10Alexandros Kosiaris: [C: 032] helm: Remove dependency on helm-init for /etc/helm [puppet] - 10https://gerrit.wikimedia.org/r/427116 (owner: 10Alexandros Kosiaris) [13:51:19] (03PS3) 10Jcrespo: mariadb: Add dbstore_multiinstance profile to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) [13:52:15] (03PS1) 10Muehlenhoff: Make mw1269 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/427117 [13:52:50] (03CR) 10Filippo Giunchedi: "example output from one of the long queries we have now:" [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [13:54:22] (03PS8) 10Ema: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712) [13:54:27] (03PS3) 10Ppchelko: Enable EventBus for job events for all but wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) [13:54:39] (03CR) 10Filippo Giunchedi: [C: 031] Remove Prometheus config for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427097 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [13:56:28] (03PS4) 10Ppchelko: Enable EventBus for job events for all but wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) [13:57:35] (03PS1) 10Alexandros Kosiaris: helm: Add write bit for user on HELM_HOME [puppet] - 10https://gerrit.wikimedia.org/r/427118 [13:58:06] (03CR) 10Giuseppe Lavagetto: [C: 031] Make mw1269 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/427117 (owner: 10Muehlenhoff) [13:58:19] (03CR) 10Alexandros Kosiaris: [C: 032] helm: Add write bit for user on HELM_HOME [puppet] - 10https://gerrit.wikimedia.org/r/427118 (owner: 10Alexandros Kosiaris) [13:59:08] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))): 20965221.713246 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:59:47] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[helm-init] [14:00:04] mobrovac: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Page Previews roll-out to enwiki. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1400). [14:00:05] raynor: A patch you scheduled for Page Previews roll-out to enwiki is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] RECOVERY - Request latencies on chlorine is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:00:43] (03CR) 10Volans: "LGTM in general, one usability comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [14:01:07] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:01:20] (03PS1) 10Vgutierrez: install_server: Reimage lvs3004 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427119 (https://phabricator.wikimedia.org/T191897) [14:01:41] (03PS4) 10Jcrespo: mariadb: Add dbstore_multiinstance profile to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) [14:01:51] o/ [14:02:02] (03PS2) 10Muehlenhoff: Make mw1269 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/427117 [14:02:18] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:25] ^ silencing [14:02:28] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[helm-init] [14:02:37] (03PS5) 10Jcrespo: mariadb: Add dbstore_multiinstance profile to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) [14:02:37] PROBLEM - Apache HTTP on mw1266 is CRITICAL: connect to address 10.64.0.61 and port 80: Connection refused [14:02:37] PROBLEM - Nginx local proxy to apache on mw1268 is CRITICAL: connect to address 10.64.0.63 and port 443: Connection refused [14:02:37] PROBLEM - MD RAID on mw1266 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:37] PROBLEM - Check size of conntrack table on mw1267 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:38] PROBLEM - Check systemd state on mw1268 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:38] PROBLEM - HHVM rendering on mw1282 is CRITICAL: connect to address 10.64.0.77 and port 80: Connection refused [14:02:38] PROBLEM - Check size of conntrack table on mw1280 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:39] PROBLEM - MD RAID on mw1280 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:39] PROBLEM - nutcracker port on mw1282 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:43] raynor: we're running late on this one, should start in 10 to 15 minutes [14:03:03] (03CR) 10Giuseppe Lavagetto: Puppet: add ping_offload role and profile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [14:03:28] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[helm-init] [14:03:29] sure, np [14:03:38] sorry for pinging in services, I clicked wrong channel ;/ [14:03:44] no worries [14:03:56] (03CR) 10Muehlenhoff: [C: 032] Make mw1269 a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/427117 (owner: 10Muehlenhoff) [14:04:18] PROBLEM - Check systemd state on mw1280 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:37] RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [14:06:07] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: connect to address 10.64.0.75 and port 443: Connection refused [14:06:07] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1280 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:06:48] (03PS1) 10Ottomata: Rename mirror::alerts source prometheus url paramater [puppet] - 10https://gerrit.wikimedia.org/r/427120 (https://phabricator.wikimedia.org/T190940) [14:07:17] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[helm-init] [14:07:22] (03CR) 10jerkins-bot: [V: 04-1] Rename mirror::alerts source prometheus url paramater [puppet] - 10https://gerrit.wikimedia.org/r/427120 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [14:07:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw1280 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:08:03] (03PS2) 10Ottomata: Rename mirror::alerts source prometheus url paramater [puppet] - 10https://gerrit.wikimedia.org/r/427120 (https://phabricator.wikimedia.org/T190940) [14:08:04] !log Depool and reimage lvs3004 as stretch - T191897 [14:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:11] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [14:08:19] (03PS1) 10Alexandros Kosiaris: mathoid: Refresh deployment on config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/427121 [14:09:06] (03PS3) 10Ottomata: Rename mirror::alerts source prometheus url paramater [puppet] - 10https://gerrit.wikimedia.org/r/427120 (https://phabricator.wikimedia.org/T190940) [14:11:17] (03PS4) 10Ottomata: Rename mirror::alerts source prometheus url paramater [puppet] - 10https://gerrit.wikimedia.org/r/427120 (https://phabricator.wikimedia.org/T190940) [14:12:27] (03PS2) 10Vgutierrez: install_server: Reimage lvs3004 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427119 (https://phabricator.wikimedia.org/T191897) [14:13:11] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs3004 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427119 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [14:14:53] (03PS5) 10Ottomata: Rename mirror::alerts source prometheus url paramater [puppet] - 10https://gerrit.wikimedia.org/r/427120 (https://phabricator.wikimedia.org/T190940) [14:15:16] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler02/10952/dbstore1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) (owner: 10Jcrespo) [14:15:24] (03PS6) 10Jcrespo: mariadb: Add dbstore_multiinstance profile to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) [14:15:28] (03CR) 10Ottomata: [V: 032 C: 032] Rename mirror::alerts source prometheus url paramater [puppet] - 10https://gerrit.wikimedia.org/r/427120 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [14:16:11] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/extension.json: Support per-event dispatch of events, file 1/3 - T191464 (duration: 03m 00s) [14:16:14] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4136099 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs3004.esams.wmnet ``` The log can be found in `/var/lo... [14:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:17] T191464: Enable CP4JQ support for private wikis - https://phabricator.wikimedia.org/T191464 [14:18:28] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 7.80, 5.86, 3.68 [14:18:47] RECOVERY - MD RAID on mw1266 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:18:47] RECOVERY - Check size of conntrack table on mw1267 is OK: OK: nf_conntrack is 0 % full [14:18:47] RECOVERY - Check size of conntrack table on mw1280 is OK: OK: nf_conntrack is 0 % full [14:18:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1280 is OK: OK ferm input default policy is set [14:18:48] RECOVERY - MD RAID on mw1280 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:19:24] (03PS6) 10Muehlenhoff: Remove conftool configuration for image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/425982 (https://phabricator.wikimedia.org/T188062) [14:19:58] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4136107 (10Vgutierrez) [14:21:25] (03CR) 10Muehlenhoff: [C: 032] Remove conftool configuration for image scalers. [puppet] - 10https://gerrit.wikimedia.org/r/425982 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [14:21:51] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4120349 (10Vgutierrez) [14:22:01] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/EventBus.php: Support per-event dispatch of events, file 2/3 - T191464 (duration: 03m 06s) [14:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:07] T191464: Enable CP4JQ support for private wikis - https://phabricator.wikimedia.org/T191464 [14:22:48] (03PS2) 10Mobrovac: Enable PagePreviews for 50% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427103 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [14:23:34] !log start es1017 reimage [14:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:19] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4136122 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs2001.codfw.wmnet']... [14:25:38] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/JobQueueEventBus.php: Support per-event dispatch of events, file 3/3 - T191464 (duration: 03m 07s) [14:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:59] RECOVERY - Check systemd state on mw1268 is OK: OK - running: The system is fully operational [14:25:59] RECOVERY - Nginx local proxy to apache on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.056 second response time [14:26:08] RECOVERY - nutcracker port on mw1282 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:26:23] (03CR) 10Mobrovac: [C: 032] Enable PagePreviews for 50% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427103 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [14:26:39] RECOVERY - Check systemd state on mw1280 is OK: OK - running: The system is fully operational [14:27:08] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 75325 bytes in 8.323 second response time [14:27:17] (03PS7) 10Jcrespo: mariadb: Add dbstore_multiinstance profile to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) [14:27:19] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.060 second response time [14:27:37] (03Merged) 10jenkins-bot: Enable PagePreviews for 50% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427103 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [14:28:50] raynor, olliv: starting sync for 50% ^ [14:29:23] sounds good [14:29:23] (03CR) 10jenkins-bot: Enable PagePreviews for 50% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427103 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [14:29:34] (03PS1) 10Jcrespo: mariadb-auto_install: Disable back the reimage for all db hosts [puppet] - 10https://gerrit.wikimedia.org/r/427126 [14:29:38] ok [14:29:51] (03CR) 10Jcrespo: [C: 032] mariadb: Add dbstore_multiinstance profile to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427114 (https://phabricator.wikimedia.org/T192358) (owner: 10Jcrespo) [14:29:53] (03PS1) 10ArielGlenn: log more xmlstream exceptions, process all input files before exit [dumps] - 10https://gerrit.wikimedia.org/r/427127 (https://phabricator.wikimedia.org/T191177) [14:30:01] (03PS2) 10Jcrespo: mariadb-auto_install: Disable back the reimage for all db hosts [puppet] - 10https://gerrit.wikimedia.org/r/427126 [14:30:43] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Enable page previews for 50% of anons for enwiki - T191101 (duration: 00m 58s) [14:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:49] T191101: Let's do this: Rollout page previews to 100% of anons on English Wikipedia - https://phabricator.wikimedia.org/T191101 [14:30:54] raynor: olliv: done ^ [14:30:59] you can test [14:31:03] on it [14:31:08] (03CR) 10Jcrespo: [C: 032] mariadb-auto_install: Disable back the reimage for all db hosts [puppet] - 10https://gerrit.wikimedia.org/r/427126 (owner: 10Jcrespo) [14:32:55] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1017 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427128 [14:34:52] (03CR) 10ArielGlenn: [C: 032] log more xmlstream exceptions, process all input files before exit [dumps] - 10https://gerrit.wikimedia.org/r/427127 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [14:35:37] (03PS1) 10Muehlenhoff: Remove more hiera entries related to the new deprecated image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427129 (https://phabricator.wikimedia.org/T188062) [14:36:06] !log ariel@tin Started deploy [dumps/dumps@1073d75]: more exception logging from xmlstream [14:36:06] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1280 is OK: OK: synced at Tue 2018-04-17 14:36:00 UTC. [14:36:09] !log ariel@tin Finished deploy [dumps/dumps@1073d75]: more exception logging from xmlstream (duration: 00m 03s) [14:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:27] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1017 for reimage" but pool it slowly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427128 [14:37:16] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:37:48] mobrovac: tested, all looks good [14:38:57] (03PS2) 10Muehlenhoff: Remove Prometheus config for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427097 (https://phabricator.wikimedia.org/T188062) [14:39:56] (03CR) 10Muehlenhoff: [C: 032] Remove Prometheus config for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427097 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [14:40:04] (03CR) 10Filippo Giunchedi: nagios: more understandable output for check_prometheus_metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [14:41:46] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4136190 (10Marostegui) Cable has been replaced by @Cmjohnson just now. [14:42:43] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136191 (10Papaul) switch port information when ready to move db2042. db2042 was on asw-c6-codfw ge-6/0/9 and now will be on asw-d3-codfw ge-3/0/ 10 new ip address will be :... [14:43:22] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427131 [14:43:32] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136194 (10Marostegui) @ayounsi can you configure asw-d3-codfw ge-3/0/ 10 for us? We want to move db2042 to that port Thanks! [14:43:45] ok olliv, if you feel comfortable with it, we can go to 75% in 15 mins, or we can postpone it a bit since we were running late on this one [14:43:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427132 [14:44:05] I will go after you, jynus [14:44:14] mobrovac: how long does it take for the graphs to update? [14:44:25] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427132 [14:44:40] mforns: ^ [14:44:54] olliv: most of them are live data [14:45:33] mobrovac: how about at 5:15? [14:45:38] hey mobrovac, yes, it takes 1-2 minutes cc olliv [14:45:54] olliv: sounds good to me [14:47:17] mobrovac: hello [14:48:00] good morning nuria_ [14:48:35] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4136205 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs2001.codfw.wmnet'] ``` and were **ALL** successful. [14:49:10] mobrovac: one question [14:49:19] mobrovac: before we move to 75% [14:49:30] mobrovac: how long is the caching ttl for these settings? [14:49:38] (03PS1) 10Muehlenhoff: Remove image scaler Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/427133 (https://phabricator.wikimedia.org/T188062) [14:49:55] mobrovac: how do we know everyone that had teh feature available at 25% actually is able to "exercise" it? [14:50:01] nuria_: you mean for the actual percentage? none, afaik [14:50:27] mobrovac: looking at graphs throughput has not changed much at all [14:50:31] raynor may correct me, but once you are bucketed, that's it [14:50:39] mobrovac: which is a bit suspicious [14:50:43] (03PS1) 10Ottomata: Reduce warning_threshold for main eqiad -> codfw MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/427134 (https://phabricator.wikimedia.org/T190940) [14:50:48] mobrovac: ya, but bucketing happens per session [14:50:56] mobrovac: so if you alredy had a session opened [14:51:08] right, the new percentage won't affect you [14:51:17] mobrovac: it does not matter whether feature is eanbled [14:51:44] mobrovac: thus we are not seeing people using it maybe because their sessions were already in place when we enabled feature [14:51:52] (03CR) 10Ottomata: [C: 032] Reduce warning_threshold for main eqiad -> codfw MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/427134 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [14:52:32] yes [14:52:39] nuria_: i do see an increase from ~250 req/s to ~350 req/s since we started on https://grafana-admin.wikimedia.org/dashboard/db/restbase?panelId=15&fullscreen&orgId=1&from=now-3h&to=now&refresh=1m (click on the first line in the list to see just the end point used by page previews) [14:52:40] once you get bucketed you're in [14:52:47] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136223 (10ayounsi) asw-d-codfw-ge-3/0/10 now in private1-d-codfw. Let me know when to disable asw-c6-codfw:ge-6/0/9 [14:52:52] (03PS1) 10Papaul: DNS: Move db2042 fron private1-c-codfw to private1-d-codfw [dns] - 10https://gerrit.wikimedia.org/r/427136 (https://phabricator.wikimedia.org/T191193) [14:52:59] !log Stop MySQL on db2042 to move it to another rack - https://phabricator.wikimedia.org/T191193 [14:53:01] raynor: but you are only bucketed at the beginning of your session [14:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] raynor: if you had an existing session the update of percentages does not affect you [14:53:51] I'm not sure, we calculate the bucket based on the session [14:54:19] I'm not sure if we store that that bucket, we might calculate it everytime (but because sessionId is the same you stay in the same bucket) [14:54:49] (03PS1) 10Pmiazga: Enable PagePreviews for 75% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427137 (https://phabricator.wikimedia.org/T191101) [14:54:50] I'll check that in a moment [14:55:37] (03CR) 10Marostegui: [C: 032] DNS: Move db2042 fron private1-c-codfw to private1-d-codfw [dns] - 10https://gerrit.wikimedia.org/r/427136 (https://phabricator.wikimedia.org/T191193) (owner: 10Papaul) [14:55:49] !log starting data reimport after re-image for wdqs2001 - T189192 [14:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:55] T189192: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192 [14:56:15] raynor: https://github.com/wikimedia/mediawiki-extensions-Popups/blob/master/src/getUserBucket.js [14:56:34] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136240 (10Papaul) [14:56:43] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4136243 (10Cmjohnson) @ayounsi I cabled the new ms-be servers to the following. Please let me know if you want that changed. 1040 a7 xe-7/0/28 1041 b7 xe-7/0/18 1042 c7 xe-7/0/28 1043 d... [14:56:47] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427132 (owner: 10Marostegui) [14:57:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4136250 (10Cmjohnson) [14:57:29] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:57:29] raynor: https://github.com/wikimedia/mediawiki-extensions-Popups/blob/master/src/isEnabled.js [14:57:30] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:45] nuria_, yes, thats the main logic we use [14:57:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427132 (owner: 10Marostegui) [14:58:01] (03PS2) 10Filippo Giunchedi: nagios: more understandable output for check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) [14:58:26] and we don't store that off/control/on anywhere [14:58:27] https://github.com/wikimedia/mediawiki-extensions-Popups/blob/master/src/isEnabled.js#L44 [14:58:33] we just check it [14:58:56] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4136265 (10Marostegui) [14:59:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427132 (owner: 10Marostegui) [14:59:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1114 after changing the network cable - T191996 (duration: 01m 02s) [14:59:28] raynor: we check it against a config setting [14:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:33] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [14:59:40] and the experiments.js also doesn't store that value [14:59:50] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:59:51] so if we change the config it should recalculate [15:00:04] mobrovac: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Page Previews roll-out to enwiki. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1500). [15:00:04] raynor: A patch you scheduled for Page Previews roll-out to enwiki is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [15:00:08] raynor: we check bucket against config setting but config setting is not refreshed every time we update it for all users , is it? [15:00:17] it means that everytime we adjust the PopupsAnonsExperimentalGroupSize [15:00:37] raynor: every time there is a pageview? mmmmm [15:00:51] (03PS1) 10Alexandros Kosiaris: helm: Add a relationship to HELM_HOME directory [puppet] - 10https://gerrit.wikimedia.org/r/427141 [15:00:53] (03PS1) 10Alexandros Kosiaris: Update scap-helm to force --reuse-values on upgrade [puppet] - 10https://gerrit.wikimedia.org/r/427142 [15:00:54] we check the bucket on every page view - but we use the sessionid [15:00:56] which doesn't change [15:01:10] PROBLEM - Host db2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:01:19] it means that on each config change we recalculate that and decide whenever to show Page Previews for anon users [15:01:21] ^ expected [15:01:25] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4136271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs3004.esams.wmnet'] ``` and were **ALL** successful. [15:02:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Refresh deployment on config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/427121 (owner: 10Alexandros Kosiaris) [15:02:32] (03CR) 10Elukey: Puppetize cron job archiving old MaxMind databases (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [15:02:40] that means when someone had page previews enabled when we set the PopupsAnonsExperimentalGroupSize to 25%, now, after adjusting config to 50% it means that this person might get bucketed to person who doesn't see Page Previews [15:02:44] (03CR) 10Alexandros Kosiaris: [C: 032] helm: Add a relationship to HELM_HOME directory [puppet] - 10https://gerrit.wikimedia.org/r/427141 (owner: 10Alexandros Kosiaris) [15:02:47] (03CR) 10Alexandros Kosiaris: [C: 032] Update scap-helm to force --reuse-values on upgrade [puppet] - 10https://gerrit.wikimedia.org/r/427142 (owner: 10Alexandros Kosiaris) [15:03:30] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:04:41] raynor: config changes take effect immediately [15:04:56] raynor: asking, sorry [15:05:43] (03PS3) 10Filippo Giunchedi: nagios: more understandable output for check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) [15:06:02] nuria_: in short - yes, it's calculated per every pageView, but it usses sessionId which stays the same during whole session (until you close the tab) [15:06:14] config changes take effect immediately [15:06:24] raynor: sorry, waht i was asking was when config changes take effect once deployed [15:06:37] immediately [15:06:53] raynor: then i do not understand why traffic of page previews has not increased [15:07:01] raynor: i think we should look at that for abit [15:07:11] cc mobrovac [15:07:30] nuria_: sorry, let me correct myself [15:07:38] immediately for new sessions [15:07:44] raynor: ahhh [15:07:47] or waiot [15:07:55] sorry, let me think twice [15:08:15] I'll test it to be 100% sure [15:08:16] one sec [15:08:36] (03PS1) 10Ottomata: Fix mirror alert parameter [puppet] - 10https://gerrit.wikimedia.org/r/427144 (https://phabricator.wikimedia.org/T190940) [15:09:02] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4136295 (10Marostegui) [15:09:05] (03PS1) 10Jcrespo: Add prometheus mysql s1 monitoring to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427145 (https://phabricator.wikimedia.org/T192358) [15:09:16] (03CR) 10Ottomata: [C: 032] Fix mirror alert parameter [puppet] - 10https://gerrit.wikimedia.org/r/427144 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [15:10:12] (03CR) 10Volans: [C: 031] "LGTM, nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [15:11:10] (03PS2) 10Muehlenhoff: Remove image scaler Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/427133 (https://phabricator.wikimedia.org/T188062) [15:11:24] (03PS1) 10Jcrespo: Update mariadb package to 10.1.34 [software] - 10https://gerrit.wikimedia.org/r/427147 [15:11:26] (03PS1) 10Jcrespo: dbhosts: Add new s1 instance to dbstore1001 [software] - 10https://gerrit.wikimedia.org/r/427148 [15:11:49] RECOVERY - Host db2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.57 ms [15:11:50] 10Operations, 10Traffic: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368#4136305 (10ema) [15:11:55] (03CR) 10Muehlenhoff: [C: 032] Remove image scaler Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/427133 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [15:12:20] 10Operations, 10Traffic: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368#4136317 (10ema) p:05Triage>03Normal [15:12:30] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:12:45] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427131 (owner: 10Jcrespo) [15:12:49] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427131 [15:13:43] nuria_: ok, I'm making noise ;/ those changes should be immediate - two things have to happen, we have to deploy the config change and user has to get the new page [15:13:58] yes, makes sense [15:14:31] when user loads the page we bucket user and basing on that we decide whenever to show the popups [15:14:39] from the back-end side, every time we increase the percentage, i see an increase of about 50 uncached reqs for the summary end point [15:14:41] change should be immediate, after deployment+page refresh [15:14:44] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136339 (10Papaul) [15:15:10] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370#4136341 (10Joe) [15:15:55] (03PS1) 10Vgutierrez: pybal: Reenable BGP on lvs3004 [puppet] - 10https://gerrit.wikimedia.org/r/427149 (https://phabricator.wikimedia.org/T191897) [15:16:56] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#4136386 (10Joe) [15:17:00] 10Operations, 10MediaWiki-Configuration, 10Availability (MediaWiki-MultiDC), 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), and 4 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#4136384 (10Joe) 05Open>03Resolved a:03Joe [15:17:44] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1017 for reimage" but pool it slowly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427128 (owner: 10Jcrespo) [15:17:46] (03PS4) 10Filippo Giunchedi: nagios: more understandable output for check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) [15:17:48] (03PS3) 10Jcrespo: Revert "mariadb: Depool es1017 for reimage" but pool it slowly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427128 [15:17:50] (03CR) 10Filippo Giunchedi: nagios: more understandable output for check_prometheus_metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [15:17:58] olliv: raynor: nuria_: ok to proceed to 75% ? [15:18:13] mobrovac: let me look at things a bit [15:18:54] (03CR) 10Jcrespo: [C: 032] Update mariadb package to 10.1.34 [software] - 10https://gerrit.wikimedia.org/r/427147 (owner: 10Jcrespo) [15:19:08] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1067 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427131 (owner: 10Jcrespo) [15:19:11] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136423 (10Papaul) switch port information when ready to move db2048. db2048 was on asw-c6-codfw ge-6/0/17 and now will be on asw-a1-codfw ge-1/0/0 new ip address will be : 10... [15:19:12] (03CR) 10Jcrespo: [C: 032] dbhosts: Add new s1 instance to dbstore1001 [software] - 10https://gerrit.wikimedia.org/r/427148 (owner: 10Jcrespo) [15:19:17] imho it's ok [15:19:43] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136425 (10Papaul) [15:21:19] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4096579 (10Papaul) Moved db2042 from c6 to d3 in racktables [15:21:31] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change db2048 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427150 (https://phabricator.wikimedia.org/T191193) [15:21:56] kk, waiting for nuria_'s green light [15:22:04] yup [15:22:04] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1067, es1017 with low load (duration: 01m 02s) [15:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:38] mobrovac: can someone explian why traffic is much lower than we expect? ccray [15:22:42] mobrovac: can someone explian why traffic is much lower than we expect? cc raynor [15:23:28] i cannot say why virtualpageview increases are so low [15:23:37] !log Stop MySQL on db2048 for rack movement - T191193 [15:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:43] T191193: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193 [15:23:51] !log Stopping mysql on db2048 will break replication on codfw s1 slaves [15:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:58] raynor: can you? [15:24:00] (03PS1) 10Papaul: DNS: Move db2048 from prvate1-c-odfw to private1-a-codfw [dns] - 10https://gerrit.wikimedia.org/r/427151 (https://phabricator.wikimedia.org/T191193) [15:24:11] nope, I'm thinking about it [15:24:42] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1017 for reimage" but pool it slowly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427128 (owner: 10Jcrespo) [15:25:17] nuria_, mobrovac - we're in the hangout if you want join as well [15:25:25] k [15:25:42] mobrovac: give me couple mins [15:27:40] mobrovac: omw [15:29:27] (03CR) 10Marostegui: [C: 032] DNS: Move db2048 from prvate1-c-odfw to private1-a-codfw [dns] - 10https://gerrit.wikimedia.org/r/427151 (https://phabricator.wikimedia.org/T191193) (owner: 10Papaul) [15:29:51] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Change db2048 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427150 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:30:04] (03PS1) 10Jcrespo: mariadb: Pool es1017 with full weight after upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427152 [15:30:34] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3878953 (10Vgutierrez) @Cmjohnson how are hostports being identified right now? I mean, how do you know which interface is eth0 and which one is eth3? We are currently upgra... [15:30:52] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2030 [dns] - 10https://gerrit.wikimedia.org/r/427153 (https://phabricator.wikimedia.org/T187768) [15:31:06] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2048 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427150 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:31:20] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2048 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427150 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:31:41] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4136494 (10Cmjohnson) @ayounsi the new card arrived and is installed...all the fibers are run. I need to know which port you prefer on each switch. I was going to use xe-4... [15:31:44] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4136495 (10Papaul) [15:31:57] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136497 (10ayounsi) >>! In T191193#4136423, @Papaul wrote: > switch port information when ready to move db2048. > db2048 was on asw-c6-codfw ge-6/0/17 and now will be on asw-a1-... [15:32:17] (03PS2) 10Marostegui: DNS: Remove mgmt DNS for db2030 [dns] - 10https://gerrit.wikimedia.org/r/427153 (https://phabricator.wikimedia.org/T187768) (owner: 10Papaul) [15:32:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2048 IP - T191193 (duration: 00m 58s) [15:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:49] T191193: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193 [15:33:00] (03CR) 10Vgutierrez: [C: 032] pybal: Reenable BGP on lvs3004 [puppet] - 10https://gerrit.wikimedia.org/r/427149 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [15:33:07] (03PS2) 10Vgutierrez: pybal: Reenable BGP on lvs3004 [puppet] - 10https://gerrit.wikimedia.org/r/427149 (https://phabricator.wikimedia.org/T191897) [15:33:30] (03CR) 10Marostegui: [C: 032] DNS: Remove mgmt DNS for db2030 [dns] - 10https://gerrit.wikimedia.org/r/427153 (https://phabricator.wikimedia.org/T187768) (owner: 10Papaul) [15:33:46] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2048 IP - T191193 (duration: 00m 58s) [15:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:07] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324#4136522 (10EddieGP) [15:37:15] PROBLEM - Host db2048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:37:20] ^expected [15:37:24] !log Repool (Enable BGP) on lvs3004 - T191897 [15:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:31] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [15:39:08] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4136535 (10Vgutierrez) [15:39:20] (03PS1) 10Alexandros Kosiaris: helm: Allow users to fetch charts in HELM_HOME [puppet] - 10https://gerrit.wikimedia.org/r/427155 [15:41:02] (03CR) 10Alexandros Kosiaris: [C: 032] helm: Allow users to fetch charts in HELM_HOME [puppet] - 10https://gerrit.wikimedia.org/r/427155 (owner: 10Alexandros Kosiaris) [15:45:38] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [15:46:24] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136548 (10Papaul) [15:46:31] (03PS1) 10Ladsgroup: Limit page creation and edit rate on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427156 (https://phabricator.wikimedia.org/T184948) [15:47:36] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4136553 (10Marostegui) [15:47:44] RECOVERY - Host db2048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms [15:49:54] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136557 (10Papaul) a:05Papaul>03Marostegui Moved db2048 from C6 to A1 in racktables @Marostegui assigning the tasks back to you if you think everything looks good you can... [15:50:12] !log enable puppet in labstore1004 [15:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:42] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4136561 (10Papaul) a:05Papaul>03RobH @Robh everything done on my side only switch port left. Thanks [15:50:59] jouncebot: refresh [15:51:00] I refreshed my knowledge about deployments. [15:51:03] jouncebot: next [15:51:03] In 0 hour(s) and 8 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1600) [15:51:33] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4136566 (10Marostegui) Adding @ayounsi as @RobH is away. [15:51:35] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [15:53:17] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136568 (10Papaul) switch port information db2012 asw-a6-codfw ge-6/0/11 [15:53:25] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:22] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136571 (10Marostegui) a:05Marostegui>03ayounsi Thanks @Papaul!! I have talked to @ayounsi and he will clean up the ports and close the task when ready [15:55:09] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) [15:56:36] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136590 (10Marostegui) Adding @ayounsi as @RobH is away [15:58:35] * mobrovac taking over tin [15:58:44] (03PS2) 10Mobrovac: Enable PagePreviews for 75% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427137 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [15:59:43] 10Operations, 10ops-eqiad, 10Traffic: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280#4136603 (10Cmjohnson) This servers warranty expired in 2014 and should be replaced instead of repaired. @faidon please comment. [15:59:44] sql [16:00:04] godog, moritzm, and _joe_: #bothumor I � Unicode. All rise for Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1600). [16:00:04] eddiegp: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:35] please wait 5 mins, i need to get something out [16:00:42] (03PS1) 10Ottomata: Install R on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) [16:00:47] (03CR) 10Mobrovac: [V: 032 C: 032] Enable PagePreviews for 75% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427137 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [16:01:07] (03CR) 10jenkins-bot: Enable PagePreviews for 75% anon users on en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427137 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [16:01:08] (03PS1) 10Gehel: wdqs: tune performance limits for the new wdqs-internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/427160 (https://phabricator.wikimedia.org/T187766) [16:01:18] (03CR) 10jerkins-bot: [V: 04-1] Install R on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) (owner: 10Ottomata) [16:01:40] are the {domain}/v1/feed/announcements warnings expected? [16:02:10] (03PS2) 10Ottomata: Install R on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) [16:02:16] from services/movileapps hosts? [16:02:20] (03PS3) 10Ottomata: Install R on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) [16:02:21] they are known to happen jynus [16:02:22] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Enable page previews for 75% of anons for enwiki - T191101 (duration: 00m 58s) [16:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:28] T191101: Let's do this: Rollout page previews to 100% of anons on English Wikipedia - https://phabricator.wikimedia.org/T191101 [16:02:30] ok i'm done [16:02:35] ok, thanks [16:02:40] raynor: olliv: nuria_: we are on 75% [16:02:43] mobrovac, thx [16:02:56] (03CR) 10jerkins-bot: [V: 04-1] Install R on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) (owner: 10Ottomata) [16:03:55] mobrovac: k [16:04:15] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10955/" [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) (owner: 10Ottomata) [16:04:42] I'm here :) [16:05:22] (03PS4) 10Ottomata: Install R on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) [16:05:57] 10Operations, 10ops-eqiad, 10Traffic: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280#4136626 (10faidon) Yup, a replacement is underway as part of T189317 :) [16:07:33] mobrovac, nuria_ so I was checking the cache [16:07:42] and the JS file is cached for 5 minutes [16:08:06] I was checking the WMF config after mobrovac deployed PagePreviews to the 75% [16:08:55] I fetched the config at 6:02pm, then it stayed the same till 6:07pm (stored on disk) [16:09:17] and on 6:07 cache expired and it fetched the new file from the server with new config [16:09:44] and the file it fetched contained the new config setting? [16:09:47] it means there is a 5 minutes delay between deployment [16:09:56] (03CR) 10Ottomata: [C: 032] Install R on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) (owner: 10Ottomata) [16:09:57] mobrovac, yes - thats correct, the file contained the new config [16:10:01] (03PS1) 10Elukey: profile::zookeeper::monitoring: improve jmx exporter's config [puppet] - 10https://gerrit.wikimedia.org/r/427162 [16:10:47] that still doesn't explain the low amplitude we are seeing [16:11:40] Anyone able & willing to do puppet swat? :) [16:13:11] eddiegp: yeah I'll take a look [16:13:21] godog: Thanks! [16:14:09] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:19] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:19] PROBLEM - puppet last run on analytics1068 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:19] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:19] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:20] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:20] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:20] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:21] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:29] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:29] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:29] PROBLEM - puppet last run on analytics1062 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:32] Hello R! [16:14:39] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:39] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:39] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:39] PROBLEM - puppet last run on analytics1060 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:49] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:49] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:14:52] ottomata: --^ [16:15:09] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:15:18] (03CR) 10Anomie: wiki replicas: Add new MCR tables to views (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [16:15:24] elukey: ah!ck k [16:15:26] on it... [16:15:29] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 37 seconds ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:15:30] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:15:30] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:15:30] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:15:30] weird, i PCCed it [16:16:17] heh I think that's at the edge between what PCC can do and an universe simulator would be able to effectively test [16:16:19] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:16:19] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:16:26] nuria_, question [16:16:31] just guessing though, I don't know the specific error [16:16:40] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:16:44] raynor: on meeting go ahead i might respond later [16:16:49] PROBLEM - puppet last run on analytics1058 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:16:53] we don't see a peak in the graph, but if you look at yesterday or the day before yesterday the graph at same time was going down [16:17:09] PCC is incredibly useful, but it doesn't amount to a proof a puppet patch will work (there are known holes it can't see into where compilation can fail) [16:17:09] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:17:09] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:17:31] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 12 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4136660 (10Fjalapeno) [16:17:34] (03PS1) 10Arturo Borrero Gonzalez: labstore[12]00[12]: apply spare system role [puppet] - 10https://gerrit.wikimedia.org/r/427163 [16:17:39] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:17:42] now we're in kinda flat graph, looks like european folks stop using the pagepreviews (so the graph should go down) but US people start the day and they balance the graph so it says flat [16:17:49] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:17:49] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:17:49] PROBLEM - puppet last run on analytics1059 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:09] PROBLEM - puppet last run on analytics1063 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:09] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:10] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:10] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:10] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:10] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:10] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:11] because of that we don't see the peak, there is different time zone, and european peak happens on different time than the US one [16:18:14] eddiegp: so https://gerrit.wikimedia.org/r/c/425967/ will require consensus, affecting production [16:18:19] PROBLEM - puppet last run on analytics1046 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:29] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[r-base],Package[r-base-dev],Package[r-recommended] [16:18:53] eddiegp: meaning someone to +1 it that's involved in what the cron does, not really for puppet swat [16:19:21] (03CR) 10Muehlenhoff: labstore[12]00[12]: apply spare system role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427163 (owner: 10Arturo Borrero Gonzalez) [16:19:52] godog: I think there was a discussion on the task, but sure, I can ask to get a +1 as well. [16:20:14] (the child task, that is). [16:20:37] eddiegp: ok, thanks! I'll look at the other patches [16:20:46] (03PS2) 10Arturo Borrero Gonzalez: labstore[12]00[12]: apply spare system role [puppet] - 10https://gerrit.wikimedia.org/r/427163 [16:20:58] godog: The first is easy, it's a no-op in prod and already cherry-picked to beta. [16:22:02] godog: The third I don't know ... let's at least run the puppet compiler on it and let it confirm it's a no-op in prod. But totally up to you (apache conf). [16:22:49] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10User-Eevans: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4054767 (10Cmjohnson) @mobrovac The 5 ssds arrived for restbase1010. Do you need to schedule down time to replace? [16:22:51] (03CR) 10Muehlenhoff: [C: 031] labstore[12]00[12]: apply spare system role [puppet] - 10https://gerrit.wikimedia.org/r/427163 (owner: 10Arturo Borrero Gonzalez) [16:22:54] eddiegp: ack, I'll start from the first, I'll let you know when done so you can remove it from beta puppetmaster cherry picks [16:23:10] (03PS6) 10Filippo Giunchedi: beta: Combine commons, deployments, meta and zero vhost (3) [puppet] - 10https://gerrit.wikimedia.org/r/426104 (owner: 10EddieGP) [16:23:12] eddiegp: hi! Don't have a ton of context but what changed from last time? IIRC it was cherry picked in beta too, but then we deployed and beta was broken for a bit (redirects but I might be wrong) [16:23:14] (03CR) 10Arturo Borrero Gonzalez: [C: 032] labstore[12]00[12]: apply spare system role [puppet] - 10https://gerrit.wikimedia.org/r/427163 (owner: 10Arturo Borrero Gonzalez) [16:23:17] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 2 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4136672 (10Cmjohnson) [16:23:34] elukey: The difference is that it wasn't cherry-picked to beta last time. [16:23:42] ack [16:23:43] And, obviously, that I amended the patch in the meantime [16:23:46] :) [16:23:56] Else it'd break again [16:24:22] (03CR) 10Filippo Giunchedi: [C: 032] beta: Combine commons, deployments, meta and zero vhost (3) [puppet] - 10https://gerrit.wikimedia.org/r/426104 (owner: 10EddieGP) [16:24:36] (03PS7) 10Filippo Giunchedi: beta: Combine commons, deployments, meta and zero vhost (3) [puppet] - 10https://gerrit.wikimedia.org/r/426104 (owner: 10EddieGP) [16:28:12] eddiegp: WRT puppet compiler, you should be able to run it yourself too via utils/pcc in puppet.git [16:28:24] I'll run it on https://gerrit.wikimedia.org/r/c/424371/ [16:28:43] Huh, you can run it locally? I didn't know that. [16:29:23] eddiegp: no, the script reaches out to jenkins [16:30:13] So how would I run it then if it needs to access jenkins? [16:31:07] The only way to run it that I knew of until now is the web form directly in jenkins. I don't have permission to use that. [16:31:09] raynor: these are pageviews for en.wikipedia, they come from all over the globe , but en.wikipedia is mostly us: https://tinyurl.com/yb5gqxfq, times are UTc [16:31:38] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 2 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4136684 (10Eevans) >>! In T189822#4136670, @Cmjohnson wrote: > @mobrovac The 5 ssds arrived for restbase1010. Do you need to sc... [16:31:52] raynor: peaks are at 4 utc so about now [16:32:21] nuria_, we found the issue [16:32:42] eddiegp: heh I don't remember what's needed to get a jenkins api token, I was reading the instructions here https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Local_run_(pcc_utility) [16:32:42] raynor: aham... [16:32:48] (03PS2) 10Jcrespo: Add prometheus mysql s1 monitoring to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427145 (https://phabricator.wikimedia.org/T192358) [16:32:54] raynor: oh, what's up? [16:33:09] * nuria_ listening [16:33:28] so we deployed to 37.5 [16:33:34] (03CR) 10Jcrespo: [C: 032] Add prometheus mysql s1 monitoring to dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/427145 (https://phabricator.wikimedia.org/T192358) (owner: 10Jcrespo) [16:33:47] not 75 - we're using the testing infrastructure for A/B tests, we forgot we have a control group [16:34:05] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 2 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4136698 (10Eevans) >>! In T189822#4136684, @Eevans wrote: >>>! In T189822#4136670, @Cmjohnson wrote: >> @mobrovac The 5 ssds ar... [16:34:10] (03PS2) 10Elukey: profile::zookeeper::monitoring: improve jmx exporter's config [puppet] - 10https://gerrit.wikimedia.org/r/427162 [16:34:10] so we can't really get to 100% raynor? [16:34:47] we can [16:34:51] we can go to 100% [16:34:53] !log decommissioning Cassandra, restbase1010-a -- T189822 [16:34:57] but we cannot go to more than 50% [16:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:59] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [16:35:03] because if we enable to X [16:35:14] yes, exactly what i said, we cannot get to 100% [16:35:16] (03CR) 10Elukey: [C: 032] profile::zookeeper::monitoring: improve jmx exporter's config [puppet] - 10https://gerrit.wikimedia.org/r/427162 (owner: 10Elukey) [16:35:27] we can do 100% [16:35:46] if we pass `0` as the group size => it means we deploy to everyone [16:35:48] eddiegp: another question, is there a task for related work to https://gerrit.wikimedia.org/r/c/424371/ ? [16:35:59] i see [16:36:02] shit [16:36:03] but if the groupsize is different than 0 it means that we run an experiment [16:36:04] ok [16:36:11] and we have three groups (on, off, control) [16:36:17] not two (on/off) [16:36:22] so 0.5 means 0.5 / 3 [16:36:51] godog: It's basically cleanup after https://gerrit.wikimedia.org/r/c/424361 which was for T173887 [16:36:52] T173887: Wikimedia.org portal broken in Beta Cluster (Domain unavailable) - https://phabricator.wikimedia.org/T173887 [16:37:03] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4136714 (10ayounsi) [16:37:04] nuria_, mobrovac - do you have time to join us in 25 min? it might be easier to make a decision that way. It looks like our best option would be to go from 37 to 100, but that is a big jump [16:37:23] But no task specifically for that removal. [16:37:27] was about to suggest a meeting olliv [16:37:36] olliv: i will be able to join in ~30, 35 mins [16:37:48] !log incremental rollout of the new zookeeper jmx config to druid1* and conf* [16:37:49] mobrovac: great, thank you [16:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:54] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#4136726 (10ayounsi) [16:38:01] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3984964 (10ayounsi) 05Open>03Resolved Switch port cleaned up. [16:38:55] eddiegp: ack, thanks [16:39:04] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10956/" [puppet] - 10https://gerrit.wikimedia.org/r/424371 (owner: 10EddieGP) [16:39:29] mobrovac: 0.5 means -> 0.5off + 0.25control + 0.25on [16:39:44] control is off [16:39:50] right raynor, so 0.5 / 3 [16:39:59] oh even less than that [16:40:01] 0.5/2 [16:40:12] euh ok [16:40:22] we're at 75% => it means 37.5% [16:40:41] are on, rest is off, we can go up to 50%, nothing more [16:40:43] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136735 (10ayounsi) Switch port cleaned up. [16:40:48] because the other half has to be control group [16:40:56] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136746 (10ayounsi) [16:40:58] (03CR) 10Filippo Giunchedi: [C: 032] mediawiki: Remove now unused parameter portal_dir [puppet] - 10https://gerrit.wikimedia.org/r/424371 (owner: 10EddieGP) [16:41:20] (03PS2) 10Filippo Giunchedi: mediawiki: Remove now unused parameter portal_dir [puppet] - 10https://gerrit.wikimedia.org/r/424371 (owner: 10EddieGP) [16:41:26] yes, it's mistake made by us [16:43:35] godog: I justed tested that section from wikitech btw. I can get an api token, but PCC won't work via the api without advanced permissions either. [16:45:05] eddiegp: ack, thanks for testing! I'm not sure what's up with that tho [16:45:17] I merged the portals_dir patch btw [16:45:39] Yeah, I saw that, thanks. :) [16:46:08] np [16:46:59] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))): 17805630.086419746 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:47:15] Well it says I'm lacking the 'Job/Build' permission. Makes sense, that permission probably allows to build *any* job in jenkins. [16:47:55] (03CR) 10GoranSMilovanovic: [C: 031] Install R on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/427159 (https://phabricator.wikimedia.org/T192348) (owner: 10Ottomata) [16:47:59] RECOVERY - Request latencies on argon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:49:20] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4136752 (10Marostegui) Thanks @ayounsi, so only pending the mgmt dns entries! [16:49:39] (03PS1) 10Ottomata: Don't need to pin backports for R on jessie anymore [puppet] - 10https://gerrit.wikimedia.org/r/427170 (https://phabricator.wikimedia.org/T192348) [16:50:10] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4136755 (10RStallman-legalteam) Thomas Arrow's NDA is fully signed and on file with legal. Thanks! [16:50:23] (03CR) 10Ottomata: [C: 032] Don't need to pin backports for R on jessie anymore [puppet] - 10https://gerrit.wikimedia.org/r/427170 (https://phabricator.wikimedia.org/T192348) (owner: 10Ottomata) [16:50:27] (03PS2) 10Ottomata: Don't need to pin backports for R on jessie anymore [puppet] - 10https://gerrit.wikimedia.org/r/427170 (https://phabricator.wikimedia.org/T192348) [16:50:29] (03CR) 10Ottomata: [V: 032 C: 032] Don't need to pin backports for R on jessie anymore [puppet] - 10https://gerrit.wikimedia.org/r/427170 (https://phabricator.wikimedia.org/T192348) (owner: 10Ottomata) [16:51:43] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4136757 (10RStallman-legalteam) @Matthias_Geisler_WMDE - just a ping on this. Looking for your email address. Thanks! [16:51:59] (03PS2) 10Jcrespo: mariadb: Pool es1017 with full weight after upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427152 [16:52:40] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:52:59] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))): 90801683.12232035 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:53:27] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4136764 (10ayounsi) 05Open>03Resolved asw-a1-codfw ge-1/0/0 cleaned up asw-c6-codfw ge-6/0/9 cleaned up [16:55:05] (03CR) 10Jcrespo: [C: 032] mariadb: Pool es1017 with full weight after upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427152 (owner: 10Jcrespo) [16:55:16] elukey: ahhh R still broken, had to make a phone call, and lunch, and now meetinsg [16:55:19] will fix though [16:55:59] RECOVERY - Request latencies on argon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:56:09] PROBLEM - Request latencies on acrab is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))): 21059423.899122808 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:56:39] (03PS1) 10Cmjohnson: Removing mgmt dns fluorine [dns] - 10https://gerrit.wikimedia.org/r/427173 (https://phabricator.wikimedia.org/T159996) [16:57:09] RECOVERY - Request latencies on acrab is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:57:14] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns fluorine [dns] - 10https://gerrit.wikimedia.org/r/427173 (https://phabricator.wikimedia.org/T159996) (owner: 10Cmjohnson) [16:57:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1017 fully (duration: 01m 16s) [16:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:32] ottomata: ack! [16:57:45] I am fixing zookeeper monitoring [16:59:04] (btw elukey we hsould upgrade all workers to stretch! having package version problems!) [16:59:22] +1 [16:59:23] (03CR) 10jenkins-bot: mariadb: Pool es1017 with full weight after upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427152 (owner: 10Jcrespo) [16:59:44] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#4136785 (10Cmjohnson) [16:59:56] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#4136787 (10Cmjohnson) [17:00:04] mobrovac: It is that lovely time of the day again! You are hereby commanded to deploy Page Previews roll-out to enwiki. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:07] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3085886 (10Cmjohnson) 05Open>03Resolved [17:00:10] mobrovac: joining hangout [17:05:09] (03PS1) 10DCausse: [cirrus] Increase the number of shards for wikidatawiki_content, enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427176 (https://phabricator.wikimedia.org/T192064) [17:11:28] (03PS1) 10Pmiazga: Enable PagePreviews to all anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427178 (https://phabricator.wikimedia.org/T191101) [17:14:56] (03CR) 10Mobrovac: [V: 032 C: 032] Enable PagePreviews to all anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427178 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [17:15:02] * mobrovac taking tin over [17:16:49] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Enable page previews for 100% of anons for enwiki - T191101 (duration: 00m 59s) [17:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:57] T191101: Let's do this: Rollout page previews to 100% of anons on English Wikipedia - https://phabricator.wikimedia.org/T191101 [17:17:33] * mobrovac is done [17:18:23] thx mobrovac - that was hell of a ride [17:18:52] hehe indeed [17:19:34] (03CR) 10jenkins-bot: Enable PagePreviews to all anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427178 (https://phabricator.wikimedia.org/T191101) (owner: 10Pmiazga) [17:22:25] ok, we see the first signs of traffic increase on the restbase side olliv, raynor, nuria_ [17:23:00] mobrovac: ok, now things are getting interesting [17:23:12] yup yup [17:24:11] mobrovac: where is best graph on your end to look at cassandra requests? [17:24:57] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4136853 (10Dzahn) 05Open>03stalled p:05Triage>03Normal a:03Matthias_Geisler_WMDE [17:25:23] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4136859 (10Dzahn) a:03Dzahn [17:25:41] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4126243 (10Dzahn) p:05Triage>03Normal [17:27:02] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4126243 (10Dzahn) "tarrow" has been added to the "wmde" LDAP group. done [17:27:14] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4136877 (10Dzahn) [17:27:42] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4126243 (10Dzahn) 05Open>03Resolved @Tarrow All done. Let us know if any unexpected issues. [17:27:43] nuria_: p99 read latencies for enwiki - https://grafana-admin.wikimedia.org/dashboard/db/cassandra?panelId=16&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fservices&var-cluster=restbase&var-keyspace=enwiki_T_page__summary&var-table=data&var-quantile=99p [17:28:27] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4136886 (10Dzahn) 05Open>03stalled [17:28:58] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4072956 (10Dzahn) Setting to Stalled, unless that meeting has already happened. In that case, please let us know the status here on ticket. [17:29:45] nuria_, we have some strange drop in VirtualPageView schema, do you see that? [17:30:14] i see a drop in restbase reqs too [17:32:10] drop confirmed in EL raynor [17:33:18] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4136901 (10Dzahn) 05Open>03stalled [17:34:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4136902 (10Gehel) Removing search backend team from this ticket, nothing left to do on our side. [17:35:19] reqs rising again on the rb side [17:35:38] yup, looks like user behavior [17:35:57] VirtualPageViews are back to ~900-ish requests per second [17:40:14] 10Operations, 10Ops-Access-Requests, 10Discovery-Search (Current work): Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4136946 (10Dzahn) 05Open>03stalled [17:43:04] mobrovac: The p99 chart is ineresting - After rises, now it's back to low levels [17:43:59] yeah joal, because we have some machines that really perform poorly from time to time, but luckily we have 3 replicas for each piece of content, so clients rarely experience that [17:44:02] so far, that is ... [17:44:27] k mobrovac - I was wondering if it could have been related to caching in some ways [17:44:48] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4136977 (10CCogdill_WMF) Bumping this! We are doing a series of newsletter tests with Chapters this quarter and it is really important for us to have ac... [17:46:02] joal: no, it's related to poor SSDs :/ [17:46:17] I hear that , thanks for clarification mobrovac :) [17:47:45] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2012 [dns] - 10https://gerrit.wikimedia.org/r/427179 (https://phabricator.wikimedia.org/T187543) [17:48:34] (03CR) 10EBernhardson: [C: 031] [cirrus] Increase the number of shards for wikidatawiki_content, enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427176 (https://phabricator.wikimedia.org/T192064) (owner: 10DCausse) [17:49:04] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4137007 (10Dzahn) 05Open>03stalled [17:50:31] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:52:39] (03PS1) 10Subramanya Sastry: Enable RemexHtml on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427181 (https://phabricator.wikimedia.org/T192301) [17:52:41] (03PS1) 10Subramanya Sastry: Enable RemexHtml on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427182 (https://phabricator.wikimedia.org/T192386) [17:55:13] (03PS1) 10Chad: group0 to wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427183 [17:55:25] raynor, mobrovac out of meeting looking at graphs again [17:55:47] raynor: did joaquin updated ticket? [17:55:54] raynor: if so can you send link? [17:56:01] !log demon@tin Started scap: bootstrap wmf.30 [17:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:23] nuria_: https://phabricator.wikimedia.org/T191101#4136834 [17:57:50] olliv: ok, updated [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1800) [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:32:53] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4137289 (10ayounsi) >>! In T191896#4136243, @Cmjohnson wrote: > @ayounsi I cabled the new ms-be servers to the following. Please let me know if you want that changed. > > 1040 a7 xe-7/0... [18:41:19] !log rebooting restbase-dev1004 (kernel oom killer misbehaving) [18:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:51] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [18:45:40] !log rebooting restbase-dev1005 (kernel oom killer misbehaving) [18:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:50] (03PS5) 10Andrew Bogott: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez) [18:46:19] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez) [18:47:40] (03PS1) 10Chad: mwdeploy: Ensure home directory exists on all machines [puppet] - 10https://gerrit.wikimedia.org/r/427188 [18:49:02] nuria_: raynor: olliv: all looking good from my side, you? [18:49:14] same here, all looks good [18:49:31] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [18:50:08] mobrovac: yes, bursty traffic but makes sense [18:50:58] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4137331 (10herron) `role::mail::mx` applies on Stretch for the most part, but not without some issues. First off, AFAICT `role::mail:mx` cannot be applied using horizon because node default contains `require :... [18:52:21] !log rebooting restbase-dev1006 (kernel oom killer misbehaving) [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:01] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:53:11] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [18:55:20] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [18:55:51] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:59:33] (03PS1) 10Aaron Schulz: Point mc-labs.php to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427189 [18:59:53] raynor: olliv: nuria_: ok, i'm out for the day, i think we are in a good spot now, if something comes up, feel free to call me [19:00:04] thcipriani: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:41] !log imarlier@tin Started deploy [performance/navtiming@22483a4]: Navtiming refactor for increased testability, and to add wrapper for easy service use [19:01:43] !log imarlier@tin Finished deploy [performance/navtiming@22483a4]: Navtiming refactor for increased testability, and to add wrapper for easy service use (duration: 00m 02s) [19:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:44] (03PS1) 10Imarlier: navtiming.py: now deployed via scap [puppet] - 10https://gerrit.wikimedia.org/r/427192 (https://phabricator.wikimedia.org/T191994) [19:14:21] PROBLEM - Request latencies on acrab is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))): 20997211.778425656 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:15:21] RECOVERY - Request latencies on acrab is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:15:26] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4137418 (10demon) >>! In T192159#4130162, @Dzahn wrote: > I _think_ it's that you are missing here: > > https://gerrit.wikimedia.org/r/#/admin/... [19:20:33] (03PS1) 10Herron: profile: update bash auto-logout timeout to 5 days [puppet] - 10https://gerrit.wikimedia.org/r/427193 (https://phabricator.wikimedia.org/T122922) [19:22:11] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))): 21560689.778443117 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:23:11] RECOVERY - Request latencies on chlorine is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:26:33] (03CR) 10Herron: [C: 032] profile: update bash auto-logout timeout to 5 days [puppet] - 10https://gerrit.wikimedia.org/r/427193 (https://phabricator.wikimedia.org/T122922) (owner: 10Herron) [19:29:54] (03PS1) 10Herron: icinga: extend screen/tmux warning time from 4 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/427195 [19:41:06] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 50.13, 31.60, 25.52 [19:41:26] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 64.44, 39.98, 32.16 [19:41:37] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 54.61, 34.06, 25.35 [19:41:47] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 48.17, 32.21, 24.72 [19:41:56] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 52.64, 26.85, 18.32 [19:42:26] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 61.64, 28.85, 19.53 [19:42:37] PROBLEM - High CPU load on API appserver on mw1279 is CRITICAL: CRITICAL - load average: 61.68, 39.59, 34.38 [19:42:56] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 30.29, 25.31, 18.34 [19:43:26] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 30.81, 26.00, 19.14 [19:47:46] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))): 18165835.13745271 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:47:46] PROBLEM - Request latencies on acrab is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))): 21595239.263868067 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:48:36] !log demon@tin Finished scap: bootstrap wmf.30 (duration: 112m 35s) [19:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:44] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4137521 (10ayounsi) @Cmjohnson: would that works for you? |lvs1016|eth0/eno1|asw2-d:xe-7/0/15|cable #4061| |lvs1016|eth1/eno2|asw2-a:xe-4/0/7 |cable #3917| |lvs1016|eth2/ens... [19:49:46] RECOVERY - Request latencies on argon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:49:46] RECOVERY - Request latencies on acrab is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.192.16.26:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:51:56] RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 22.16, 24.37, 23.98 [19:53:58] (03PS1) 10Andrew Bogott: nfs-exportd: add some exception handling to do_fix_and_export [puppet] - 10https://gerrit.wikimedia.org/r/427199 [19:54:01] (03CR) 10Smalyshev: [C: 031] wdqs: tune performance limits for the new wdqs-internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/427160 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [19:54:55] (03CR) 10Andrew Bogott: [C: 032] nfs-exportd: add some exception handling to do_fix_and_export [puppet] - 10https://gerrit.wikimedia.org/r/427199 (owner: 10Andrew Bogott) [19:55:06] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 19.65, 21.89, 23.96 [19:56:46] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 17.69, 21.67, 23.95 [19:57:26] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 32.58, 32.11, 32.19 [19:58:31] (03PS1) 10Ladsgroup: mediawiki: Add clearTermSqlIndexSearchFields for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/427202 (https://phabricator.wikimedia.org/T189779) [19:59:39] !log smalyshev@tin Started deploy [wdqs/wdqs@f08fbcc]: GUI update [19:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:20] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4137543 (10Dzahn) 05Open>03Resolved a:03Dzahn Thanks @demon! @pmiazga Now you should have the +2 and things should just work. I'm callin... [20:02:49] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4137546 (10Dzahn) p:05Triage>03Normal [20:03:26] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.18, 33.13, 28.80 [20:03:57] (03CR) 10Chad: [C: 032] group0 to wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427183 (owner: 10Chad) [20:04:06] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. [20:05:10] (03Merged) 10jenkins-bot: group0 to wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427183 (owner: 10Chad) [20:08:37] gehel: re: maps access request: do you think it makes more sense to add some new sudo privileges lines to the "maps-admin" group ? [20:09:10] be back soon [20:12:27] (03CR) 10jenkins-bot: group0 to wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427183 (owner: 10Chad) [20:13:14] !log demon@tin rebuilt and synchronized wikiversions files: group0 to wmf.30 [20:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:35] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4137561 (10Niharika) @pmiazga Feel free to ping me on IRC and/or setup calendar time to do a walkthrough during a morning SWAT. I'm based in SF. [20:19:57] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4137586 (10RStallman-legalteam) @Tim_WMDE - If you provide your full name and email address here or contact me directly at rstallman@wikimedia.org, I'll create an NDA for... [20:21:27] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 36.52, 33.34, 32.22 [20:29:36] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 15.87, 20.84, 23.73 [20:40:05] (03PS3) 10Ayounsi: Puppet: add ping_offload role and profile [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) [20:43:10] (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10957/ping1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [20:43:19] (03PS4) 10Ayounsi: Puppet: add ping_offload role and profile [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) [20:44:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [20:51:50] (03PS1) 10Andrew Bogott: labtestweb: remove puppet classes for syncing with wikitech-static [puppet] - 10https://gerrit.wikimedia.org/r/427256 [20:51:52] (03PS1) 10Andrew Bogott: labtestweb: Allow access to db from maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/427257 (https://phabricator.wikimedia.org/T192339) [20:52:23] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4137680 (10ayounsi) [20:53:24] (03CR) 10Andrew Bogott: [C: 032] labtestweb: remove puppet classes for syncing with wikitech-static [puppet] - 10https://gerrit.wikimedia.org/r/427256 (owner: 10Andrew Bogott) [20:53:30] (03CR) 10Andrew Bogott: [C: 032] labtestweb: Allow access to db from maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/427257 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [20:55:21] (03PS1) 10Andrew Bogott: Revert "labtestweb: Allow access to db from maintenance hosts" [puppet] - 10https://gerrit.wikimedia.org/r/427258 [20:56:12] (03CR) 10Andrew Bogott: [C: 032] Revert "labtestweb: Allow access to db from maintenance hosts" [puppet] - 10https://gerrit.wikimedia.org/r/427258 (owner: 10Andrew Bogott) [20:57:46] (03PS5) 10Ppchelko: Enable EventBus for job events for all but wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) [20:58:27] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:37] RECOVERY - Keyholder SSH agent on labpuppetmaster1002 is OK: OK: Keyholder is armed with all configured keys. [21:08:27] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:12:18] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4137700 (10Nuria) {F17046846} Indeed things look like they are coming back, Nigeria pageviews are present again and US traffic is qui... [21:15:13] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4137702 (10Nuria) [21:15:48] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#3961955 (10Nuria) Solving ticket. Added note to: https://wikitech.wikimedia.org/wik... [21:15:56] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4137704 (10Nuria) Solving ticket. Added note to: https://wikitech.wikimedia.org/wik... [21:16:42] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#3961955 (10Nuria) 05Open>03Resolved [21:18:57] RECOVERY - High CPU load on API appserver on mw1279 is OK: OK - load average: 27.40, 29.18, 29.99 [21:23:57] (03PS1) 10Legoktm: Update ExtensionDistributor for REL1_31 branching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427262 [21:26:02] 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Review changes to /etc/java-8-openjdk/security/java.security in Kafka from u162 update - https://phabricator.wikimedia.org/T190400#4137727 (10Nuria) 05Open>03Resolved [21:29:34] (03CR) 10Chad: [C: 032] Update ExtensionDistributor for REL1_31 branching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427262 (owner: 10Legoktm) [21:30:58] (03Merged) 10jenkins-bot: Update ExtensionDistributor for REL1_31 branching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427262 (owner: 10Legoktm) [21:31:12] (03CR) 10jenkins-bot: Update ExtensionDistributor for REL1_31 branching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427262 (owner: 10Legoktm) [21:34:24] !log demon@tin Synchronized wmf-config/CommonSettings.php: ext-dist config changes for rel1_31 (duration: 01m 16s) [21:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:20] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4137759 (10jcrespo) 05Resolved>03Open [22:05:13] (03PS1) 10Dzahn: admin: let maps-admins run any command as postgres,osmupdater,cassandra [puppet] - 10https://gerrit.wikimedia.org/r/427271 (https://phabricator.wikimedia.org/T192115) [22:12:14] (03CR) 10MaxSem: [C: 031] admin: let maps-admins run any command as postgres,osmupdater,cassandra [puppet] - 10https://gerrit.wikimedia.org/r/427271 (https://phabricator.wikimedia.org/T192115) (owner: 10Dzahn) [22:13:23] (03PS1) 10Ejegg: CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) [22:14:07] (03PS2) 10Ejegg: CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) [22:14:10] (03CR) 10jerkins-bot: [V: 04-1] CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [22:14:51] (03CR) 10jerkins-bot: [V: 04-1] CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [22:15:13] (03PS3) 10Ejegg: CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) [22:15:47] (03CR) 10Ejegg: [C: 04-1] "-1ing to avoid premature deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [22:20:52] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#3961964 (10atgo) Thank you! [22:24:33] (03PS4) 10Ejegg: Beta CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) [22:26:08] (03PS1) 10Ejegg: CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427275 (https://phabricator.wikimedia.org/T190100) [22:26:38] (03CR) 10Ejegg: [C: 04-1] "-1 ing to prevent premature deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427275 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [22:26:56] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 18.39, 21.46, 23.63 [22:30:56] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4137843 (10DFoy) Thanks everyone, great to see this working again! [22:31:36] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: sudoer access for pnorman on maps servers - https://phabricator.wikimedia.org/T192115#4137844 (10Dzahn) @Gehel Ok, so after looking at it again, i have a new suggestion. Adding permissions to the existing maps-admin group. There are j... [22:33:26] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 20 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [22:34:15] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: sudoer access for pnorman on maps servers - https://phabricator.wikimedia.org/T192115#4137846 (10Dzahn) p:05Triage>03Normal [22:34:51] 10Operations, 10hardware-requests: Reclaim/Decommission Silver.wikimedia.org - https://phabricator.wikimedia.org/T190085#4137847 (10Dzahn) p:05Triage>03Normal [22:35:31] (03CR) 10AndyRussG: [C: 032] "Woohooo!!!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [22:36:45] (03Merged) 10jenkins-bot: Beta CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [22:38:26] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 0 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [22:41:24] (03CR) 10jenkins-bot: Beta CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427273 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [22:54:35] (03CR) 10Aaron Schulz: [C: 032] Point mc-labs.php to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427189 (owner: 10Aaron Schulz) [22:55:55] (03Merged) 10jenkins-bot: Point mc-labs.php to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427189 (owner: 10Aaron Schulz) [22:55:57] (03PS1) 10EBernhardson: Shift search traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427278 (https://phabricator.wikimedia.org/T191236) [22:58:56] (03CR) 10jenkins-bot: Point mc-labs.php to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427189 (owner: 10Aaron Schulz) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T2300). [23:00:04] Gilles and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:16] \o [23:00:23] o/ [23:00:42] mind if I go first? I'd like to go to bed soon :} [23:00:59] gilles: sure, i'm actually just running a warmup routine now on the cluster anyways. It will take a few [23:01:07] (03CR) 10Gilles: [C: 032] Fix $wgLocalFileRepo definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) (owner: 10Gilles) [23:02:36] (03Merged) 10jenkins-bot: Fix $wgLocalFileRepo definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) (owner: 10Gilles) [23:04:21] (03CR) 10jenkins-bot: Fix $wgLocalFileRepo definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) (owner: 10Gilles) [23:05:08] AndyRussG: are you deploying e3c20c30d7f4605a ? [23:05:59] AaronSchulz: hi! It's just a change to beta cluster config. Just testing it now... [23:06:13] There's a related production change that hasn't been +2'd yet [23:06:25] (03PS2) 10Dzahn: icinga: extend screen/tmux warning time from 4 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/427195 (owner: 10Herron) [23:06:31] Planning to sync both during a SWAT deploy tomorrow :) [23:06:48] the one you mention shouldn't affect anything on prod were it to by synced early though [23:07:07] (Hope we're doing this right :) ) [23:07:33] (03CR) 10Luke081515: [C: 031] icinga: extend screen/tmux warning time from 4 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/427195 (owner: 10Herron) [23:07:43] !log gilles@tin Synchronized wmf-config/filebackend.php: Fix private wiki DC configuration: [[gerrit:424618|Serve private wiki thumbnails with Thumbor (T191643)]] (duration: 01m 18s) [23:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:49] T191643: Thumbor private wiki thumbnail requests are incorrectly served by codfw - https://phabricator.wikimedia.org/T191643 [23:08:07] ebernhardson: I'm all done! [23:08:10] AndyRussG: once you merge you pretty much have to deploy [23:08:34] AaronSchulz: ah hmmm [23:08:37] !log Private wiki thumbnail traffic now going to eqiad T191643 [23:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:59] AaronSchulz: I guess I'll talk to whoever's doing the SWAT deploy right now then... does that make sense? [23:11:23] AndyRussG: since the repo on tin is just master + security patches, it's odd to deploy something and notice git log HEAD..origin/master has some random-seeming changes. Also, afaik, some alerts get triggered if there are undeployed changes for too long. [23:11:40] greg-g: we still have those checks right? [23:11:41] (03CR) 10EBernhardson: [C: 032] Shift search traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427278 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [23:12:48] AaronSchulz: ah interesting... [23:13:02] AndyRussG: I guess ebernhardson can just deploy it all at once if he's OK with it [23:13:06] (03Merged) 10jenkins-bot: Shift search traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427278 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [23:13:20] (03CR) 10jenkins-bot: Shift search traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427278 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [23:13:36] AaronSchulz: is he doing the SWAT now? [23:14:28] or is something else going on? [23:16:56] (03CR) 10Dzahn: [C: 032] icinga: extend screen/tmux warning time from 4 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/427195 (owner: 10Herron) [23:17:06] I also have deploy rights, though I've never deployed a config change [23:17:30] AaronSchulz: it should be still there yes (the warning on undeployed changes) [23:17:35] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T191236: Shift search traffic back to eqiad (duration: 01m 17s) [23:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:41] T191236: Resolve elasticsearch latency alerts - https://phabricator.wikimedia.org/T191236 [23:17:55] (03PS3) 10Dzahn: icinga: extend screen/tmux warning time from 4 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/427195 (owner: 10Herron) [23:18:56] (03CR) 10jerkins-bot: [V: 04-1] icinga: extend screen/tmux warning time from 4 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/427195 (owner: 10Herron) [23:19:30] (03PS4) 10Dzahn: icinga: extend screen/tmux warning time from 4 days to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/427195 (https://phabricator.wikimedia.org/T165348) (owner: 10Herron) [23:21:11] greg-g: and that's both remove vs local git *and* mediawiki-staging vs /srv/mediawiki ? [23:21:20] *remote vs local [23:28:07] ebernhardson or gilles if you're pushing out config changes just now, feel like syncing this one too, maybe? https://gerrit.wikimedia.org/r/#/c/427273/ [23:29:21] AndyRussG: sure, sec [23:29:27] PROBLEM - nova-compute proc minimum on labvirt1015 is CRITICAL: Return code of 255 is out of bounds [23:29:30] AaronSchulz: thanks much for the heads-up! [23:29:32] ebernhardson: thanks! [23:30:28] RECOVERY - nova-compute proc minimum on labvirt1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [23:30:29] I also added it to the Deployments page for this slot, just for the rectod [23:30:47] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=75%) [23:30:51] AndyRussG: looks to have already been merged on tin, syncing out the -labs file but should be a noop [23:31:48] !log ebernhardson@tin Synchronized wmf-config/CommonSettings-labs.php: labs config noop (duration: 01m 15s) [23:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:54] ebernhardson: yeah we just merged this in gerrit to test a bit on the beta cluster, deploying the prod change tomorrow [23:32:14] (along with the related feature, which is not yet out on prod) [23:36:50] with that, SWAT i scomplete! [23:38:49] AaronSchulz: not 100% sure [23:41:53] ebernhardson: thanks! [23:43:44] (03CR) 10Krinkle: [C: 031] navtiming.py: now deployed via scap [puppet] - 10https://gerrit.wikimedia.org/r/427192 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [23:48:03] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4138028 (10Tbayer) Great, thanks everyone! But do we now know what caused the corre... [23:48:48] (03CR) 10Dzahn: [C: 032] DNS: Remove mgmt DNS for db2012 [dns] - 10https://gerrit.wikimedia.org/r/427179 (https://phabricator.wikimedia.org/T187543) (owner: 10Papaul) [23:49:49] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4138030 (10Nuria) The proxy list for zero was emptied and that must have included a... [23:52:00] (03PS2) 10Dzahn: navtiming.py: now deployed via scap [puppet] - 10https://gerrit.wikimedia.org/r/427192 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [23:52:14] (03PS1) 10Aaron Schulz: Enable mcrouter routing key prefixes for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427281 [23:52:29] (03CR) 10Dzahn: [C: 032] navtiming.py: now deployed via scap [puppet] - 10https://gerrit.wikimedia.org/r/427192 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [23:53:54] (03PS5) 10Dzahn: add mgmt DNS for nihonium, new eqiad maintenance server [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) [23:54:43] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4138046 (10Dzahn) 05Open>03stalled p:05Triage>03High [23:54:56] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4138048 (10Dzahn) p:05Triage>03High [23:56:13] mutante: thx [23:56:26] mark: navtiming scapify is rolling out [23:56:31] sorry mar k [23:56:33] marlier: ^ [23:56:58] (03CR) 10Aaron Schulz: [C: 032] Enable mcrouter routing key prefixes for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427281 (owner: 10Aaron Schulz) [23:57:58] welcome, Krinkle [23:58:09] (03Merged) 10jenkins-bot: Enable mcrouter routing key prefixes for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427281 (owner: 10Aaron Schulz) [23:58:51] (03CR) 10jenkins-bot: Enable mcrouter routing key prefixes for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427281 (owner: 10Aaron Schulz) [23:59:00] Krinkle: typing is hard :-) [23:59:20] mutante: thanks, I'll verify, should be safe so NBD