[00:02:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:07:55] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:24:35] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:25:44] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:26:48] (03PS2) 10Krinkle: Change logo for eswiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) (owner: 10Urbanecm) [00:27:04] (03CR) 10Krinkle: [C: 032] "ImageOptim was able to compress it slightly more via AdvPNG." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) (owner: 10Urbanecm) [00:27:21] * Krinkle stating on deploy1001 and mwdebug1002 [00:27:25] staging* [00:27:47] Urbanecm: stand by for verification :) [00:28:24] (03Merged) 10jenkins-bot: Change logo for eswiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) (owner: 10Urbanecm) [00:28:37] (03CR) 10jenkins-bot: Change logo for eswiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) (owner: 10Urbanecm) [00:30:16] Urbanecm: live on mwdebug1002 now [00:33:39] abian: Urbanecm: Please verify and confirm that this should deploy now. Is there another change you're trying to sync it with on-wiki? [00:34:08] Ah, it seems it was already changed on-wiki using common.css. Without @2, though, so it's serving low-res images at the moment. [00:34:42] Yes, that was a temporary solution [00:34:49] And makes it hard to verify for you , but if you're comfortable with the DevTools in your browser, you can simulate it by turning those rules off. Let me know, I can also try to verify it for you instead. [00:36:22] !log krinkle@deploy1001 Synchronized static/images/project-logos/: T198761 - Update eswiki logo (duration: 00m 51s) [00:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:26] T198761: Replace logo of the Spanish Wikipedia, which joins the protests against the European copyright directive proposal - https://phabricator.wikimedia.org/T198761 [00:36:29] I guess these changes will take some hours to propagate (I don't know the infrastructure, but that's what happened when I updated the Wikidata logo) [00:36:44] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:37:14] abian: https://i.imgur.com/fxg0149.png [00:37:48] that's how I confirmed that it worked via the mwdebug1002 server [00:37:51] It's now deployed. [00:37:56] I'll also purge the cache. [00:37:58] !log Purge https://en.wikipedia.org/static/images/project-logos/eswiki.png [00:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:02] !log Purge https://en.wikipedia.org/static/images/project-logos/eswiki-1.5x.png [00:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:05] !log Purge https://en.wikipedia.org/static/images/project-logos/eswiki-2x.png [00:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:27] Yeah, apparently fixed now :) [00:38:38] abian: For people that have browsed es.wikipedia recently, their browser will still remember the old file for a long time indeed. But for any new visitors it will use the new copy. It is updated in our caches, but we do not control the browser cache :) [00:38:54] Sure :) [00:39:00] Thanks, Krinkle! [00:39:04] yw :) [00:39:24] Remember to remove the common.css override to avoid browsers downloading the logo twice :) [00:40:15] {{done}} [00:42:14] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:46:07] 10Operations, 10Release-Engineering-Team, 10Scap, 10Performance-Team (Radar): mwscript emits warning "grep: GREP_OPTIONS is deprecated; please use an alias or script" - https://phabricator.wikimedia.org/T198775 (10Krinkle) [00:47:45] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:54:24] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:19:49] is it just me, or does anyone else still see https://meta.wikimedia.org/wiki/Special:UsersWhoWillBeRenamed? [01:23:54] PROBLEM - Memory correctable errors -EDAC- on scb1002 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [01:24:46] I see the header for an empty table. [01:26:04] huh [01:26:11] that shouldn't be there [01:27:09] https://phabricator.wikimedia.org/T118637 should have removed it, as $wgCentralAuthEnableUsersWhoWillBeRenamed should default to false [01:28:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:29:25] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:32:01] it doesn't look to be set via mw-config either [01:39:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:40:25] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:42:10] o wait, nvm, looks like it didn't get deployed [01:53:53] (03PS1) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) [01:54:47] (03CR) 10jerkins-bot: [V: 04-1] webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [01:55:49] (03PS2) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) [01:58:04] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:04:34] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:35:36] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.10) (duration: 14m 57s) [02:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:44] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:43:15] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:45:51] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Jul 4 02:45:51 UTC 2018 (duration 10m 16s) [02:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:59:54] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:05:15] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:06:24] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:17:25] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:18:34] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:24:04] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:29:35] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:40:34] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:58:24] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:59:57] (03PS3) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) [04:09:25] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:10:26] (03CR) 10Krinkle: "No-op in puppet compiler for prod, and applies cleanly to beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [04:10:34] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:10:44] (03CR) 10Krinkle: "https://puppet-compiler.wmflabs.org/compiler02/11659/" [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [04:17:50] (03PS1) 10Krinkle: profiler: Enable xenon collection in labs (same as prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443760 (https://phabricator.wikimedia.org/T195312) [04:19:16] (03CR) 10Krinkle: [C: 032] profiler: Enable xenon collection in labs (same as prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443760 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [04:20:36] (03Merged) 10jenkins-bot: profiler: Enable xenon collection in labs (same as prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443760 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [04:20:41] (03CR) 10jenkins-bot: profiler: Enable xenon collection in labs (same as prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443760 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [04:20:55] (03PS2) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 [04:21:02] (03PS2) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 [04:27:46] (03CR) 10Tim Starling: [C: 032] Rewrite sql script to use the new mysql.php wrapper [puppet] - 10https://gerrit.wikimedia.org/r/441153 (owner: 10Tim Starling) [04:28:11] (03PS4) 10Tim Starling: Rewrite sql script to use the new mysql.php wrapper [puppet] - 10https://gerrit.wikimedia.org/r/441153 [04:29:54] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 430.55 seconds [04:32:35] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:33:14] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:33:53] (03PS1) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 [04:35:16] (03PS2) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 [04:38:05] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:43:35] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:46:34] (03PS3) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 (https://phabricator.wikimedia.org/T195312) [04:49:14] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:50:31] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072 (10Marostegui) Anything left after repooling the host? [04:53:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443763 (https://phabricator.wikimedia.org/T191316) [04:54:18] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072 (10jcrespo) 05Open>03Resolved a:03Cmjohnson I don't think so. [04:55:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443763 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [04:57:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443763 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [04:57:26] (03PS5) 10Krinkle: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) [04:57:28] (03PS4) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) [04:57:30] (03PS4) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 [04:57:32] (03PS1) 10Krinkle: webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764 [04:57:42] (03PS2) 10Krinkle: webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764 (https://phabricator.wikimedia.org/T195312) [04:57:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443763 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [04:59:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 for alter table (duration: 00m 52s) [04:59:01] !log Deploy schema change on db1101:3317 T191316 T192926 T89737 T195193 [04:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:10] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [04:59:11] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [04:59:11] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [04:59:11] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:03:04] (03PS1) 10Jcrespo: mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 [05:04:08] (03CR) 10Marostegui: [C: 031] mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 (owner: 10Jcrespo) [05:04:26] (03PS1) 10Krinkle: deploymen-prep: Remove mediawiki06 from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/443767 (https://phabricator.wikimedia.org/T192996) [05:04:38] (03PS2) 10Krinkle: deployment-prep: Remove mediawiki06 from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/443767 (https://phabricator.wikimedia.org/T192996) [05:28:07] !log Optimize recentchanges table on s3 codfw - this will generate lag on codfw s3 - T178290 [05:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:11] T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290 [05:34:08] !log Optimize recentchanges table on s3 eqiad, host by host T178290 [05:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:12] T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290 [05:34:59] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:36:08] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:41:09] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:52:59] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:58:28] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:59:38] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:00:58] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [06:03:36] !log Optimize recentchanges table on s7 codfw - this will generate lag on codfw s7 - T178290 [06:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:40] T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290 [06:06:04] (03PS2) 10Giuseppe Lavagetto: service::node: Expose the MW appservers' host to modules [puppet] - 10https://gerrit.wikimedia.org/r/443444 (https://phabricator.wikimedia.org/T198461) (owner: 10Mobrovac) [06:06:14] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node: Expose the MW appservers' host to modules [puppet] - 10https://gerrit.wikimedia.org/r/443444 (https://phabricator.wikimedia.org/T198461) (owner: 10Mobrovac) [06:07:28] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [06:14:26] (03CR) 10Elukey: [C: 031] wmf-auto-reimage: use warning log level [puppet] - 10https://gerrit.wikimedia.org/r/443671 (owner: 10Volans) [06:15:37] !log reimage aqs1008 to Debian Stretch [06:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:06] (03CR) 10Tim Starling: Fix phabricator rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4) [06:21:39] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:23:09] PROBLEM - tileratorui on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6535: Connection refused [06:23:39] PROBLEM - tilerator on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6534: Connection refused [06:27:09] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:29:04] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:29:07] Anyone working with phabricator? [06:29:14] puppet error there, very recent [06:30:23] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/Lets_Encrypt_Authority_X3.crt] [06:32:03] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:32:29] jynus: do you mean on phab1002? [06:32:40] yes [06:32:57] I think it is still wip, Daniel was working on it before leaving for holidays [06:33:06] but the error is recent [06:33:14] either something else changed [06:33:23] or something I cannot think of [06:33:24] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:33:53] yeah you are right, weird [06:34:39] jynus: was temporary, just re-run puppet and it went fine [06:35:08] then maybe puppetmaster temporary issues [06:35:53] RECOVERY - tileratorui on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.091 second response time [06:37:31] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:43:57] <_joe_> !log restarting tilerator on maps-test2004, to check if it can recover [06:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:41] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:53:44] <_joe_> I think that's the puppetmaster logrotating [06:53:51] <_joe_> the puppet failures above [06:54:01] <_joe_> I've seen them repeatedly around 6:30 [06:55:20] PROBLEM - Host kubernetes2003 is DOWN: PING CRITICAL - Packet loss = 100% [06:55:50] RECOVERY - Host kubernetes2003 is UP: PING WARNING - Packet loss = 86%, RTA = 471.89 ms [06:56:01] RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.075 second response time [06:57:10] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:42] PROBLEM - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-registry.discovery.wmnet/calico/node] [07:00:33] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:07:42] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:07:45] (03PS1) 10Giuseppe Lavagetto: redis: remove cronjob for restarts on slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/443772 (https://phabricator.wikimedia.org/T191316) [07:07:47] (03PS1) 10Giuseppe Lavagetto: redis: remove now-useless specific classes [puppet] - 10https://gerrit.wikimedia.org/r/443773 (https://phabricator.wikimedia.org/T191316) [07:09:32] \o/ [07:09:43] RECOVERY - puppet last run on kubernetes2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:11:12] <_joe_> elukey: I'm not done :P [07:14:02] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:18:57] !log reimage kubernetes100{1,2}.eqiad.wmnet kubernetes200{1,2}.codfw.wmnet without swap [07:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:38] !log installing imagemagick security updates [07:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:55] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:30:55] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:33:47] that's ^ expected [07:33:49] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, let's coordinate on when to merge this" [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [07:36:02] (03CR) 10Hashar: [C: 032] ":]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443475 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian) [07:36:11] ^^ labs / beta only [07:36:36] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:37:23] (03Merged) 10jenkins-bot: Give a name to en-rtl wiki in Special:SiteMatrix in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443475 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian) [07:37:36] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:38:06] !log rebased /srv/mediawiki-staging on deploy1001 for beta cluster only change https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/443475/ [07:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:11] (03CR) 10jenkins-bot: Give a name to en-rtl wiki in Special:SiteMatrix in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443475 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian) [07:42:14] (03PS4) 10Ema: cache::text: ship cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/440157 (https://phabricator.wikimedia.org/T164609) [07:42:57] (03CR) 10Ema: [C: 032] cache::text: ship cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/440157 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [07:43:05] !log resuming rolling restart of cassandra on restbase hosts in eqiad to pick up OpenJDK security update [07:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:05] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443780 [07:46:22] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:48:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443780 (owner: 10Marostegui) [07:49:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443780 (owner: 10Marostegui) [07:49:22] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:49:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443780 (owner: 10Marostegui) [07:50:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 after alter table (duration: 00m 53s) [07:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:39] (03CR) 10Alexandros Kosiaris: [C: 031] Use backports version of osm2pgsql on Stretch for improved memory handling [puppet] - 10https://gerrit.wikimedia.org/r/443668 (https://phabricator.wikimedia.org/T198485) (owner: 10Mholloway) [07:53:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443782 (https://phabricator.wikimedia.org/T191316) [07:53:58] !log reimaging silver (spare host, to-be-decomm'ed) as testing host for the reimage script [07:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443782 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:57:47] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Gehel) For WDQS, we should keep access to at least 2 nodes in both eqiad and codfw. I propose: wdqs1003: 10.64.0.14 wdqs1004: 10.64.0.17 wdqs2001: 10.192.32.1... [07:58:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443782 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:58:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443782 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [07:59:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 for alter table (duration: 00m 50s) [07:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:16] !log Deploy schema change on db1078 T191316 T192926 T89737 T195193 [08:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:21] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:00:22] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [08:00:22] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [08:00:23] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [08:05:34] PROBLEM - LVS HTTP IPv4 on mathoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.20 and port 10042: Connection refused [08:05:42] uh oh [08:05:51] I might have been too eager here, looking [08:06:13] mathoid threw its toys out of the pram? [08:06:19] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled [08:06:26] ah ok explainable, my mistake [08:06:40] I should have kept the pace at 1 host at a time [08:06:50] ah ha [08:06:52] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled [08:06:56] should fix itself really quick [08:07:03] ok... [08:08:20] kubernetes is already creating the containers on kubernetes2003, kubernetes2004 [08:09:47] RECOVERY - LVS HTTP IPv4 on mathoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.076 second response time [08:09:55] !log rebooting multatuli for kernel update to 4.9.107~wmf1 [08:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:02] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet]) [08:10:02] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet]) [08:10:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) First batch of changes: ``` delete firewall family inet filter analytics-in4 term logstash delete firewall family inet filter analytics-in4 term event... [08:10:07] ok fixed. [08:10:22] I got only the recovery page so far [08:10:27] 4 mins of outage. But it self-healed without me having to do anything, so that's something [08:10:54] I got both [08:10:54] and now I got the problem one, reverse order [08:11:09] so 4 mins of delay ? [08:11:14] yep :( [08:11:28] it's probably on your carrier [08:11:30] I got them just fine [08:11:35] ack [08:11:39] yes, me too [08:11:40] I got both of them in the proper order... the LVS HTTP IPv4 on mathoid.... [08:11:53] akosiaris: what was the thing that should have happened one host at a time, out of curiosity? [08:11:55] some italian issue [08:11:57] :P [08:12:17] just came back [08:12:22] ema: reimaging hosts I believe [08:12:31] (03PS1) 10Filippo Giunchedi: graphite: add graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443785 (https://phabricator.wikimedia.org/T196483) [08:12:32] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [08:12:44] do you already know what was the issue? [08:12:56] ema: reimaging. So TL;DR is that I had reimaged yesterday kubernetes2003,4 successfully and in sequence (passing --sequential). Then today I became a bit more bold and did more hosts (kubernetes2001,2) in parallel [08:13:09] but from the reimaging of yesterday all the pods were scheduled on those 2 hosts [08:13:17] so I effectively killed all pods at the same time [08:13:22] chaos monkey ftw [08:13:27] is mathoid on kubenetes? [08:13:30] yup [08:13:37] ah, sorry, wasn't understanding [08:13:37] (03PS1) 10Elukey: role::eventlogging::analytics::zeromq: delete unused role [puppet] - 10https://gerrit.wikimedia.org/r/443787 [08:13:43] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet operation_type={container_status,create_container,image_status,list_containers,list_podsandbox,podsandbox_status,pull_image,run_podsandbox,start_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:13:44] I though that was unrelated [08:14:18] (03CR) 10Elukey: [C: 032] role::eventlogging::analytics::zeromq: delete unused role [puppet] - 10https://gerrit.wikimedia.org/r/443787 (owner: 10Elukey) [08:14:21] again, the good thing is this recovered from double hardware issue in its own without any action on my part in 4 mins [08:14:32] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet operation_type={create_container,image_status,list_containers,list_podsandbox,podsandbox_status,pull_image,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:14:33] so at least that's something [08:14:37] nice, yes :) [08:14:42] the latencies are expected [08:14:48] they got scheduled on different pods? [08:14:50] after thing I am gonna raise finally those threshold [08:14:52] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:14:52] mathoid should not have a lot of impact on end users unles being down for a long time, doesn't it? [08:14:58] volans: you mean nodes, but yes [08:15:02] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [08:15:02] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [08:15:03] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [08:15:04] yeah, sorry [08:15:10] jynus: overall ? yes that's correct [08:15:14] nice :) [08:15:24] it should only kick on rendering, I guess? [08:15:32] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:15:34] but I don't know much about that level of the stack [08:15:35] yes and then the result is stored in restbase [08:15:44] so no big deal [08:15:55] yeah we probably lost a few renders [08:15:59] in fact I can calculate them [08:16:00] that is ok [08:16:18] I guess they also get a rerender on next display [08:16:20] and even then, mediawiki will retry IIRC [08:16:25] yeah [08:17:07] It would be nice to even get a render with "this part failed", but that may not be possible or easy (or it already happens) [08:17:07] so current rate is around 0.1 req/s [08:17:48] so ~24 initially failed render requests [08:17:53] https://grafana.wikimedia.org/dashboard/db/service-mathoid?panelId=8&fullscreen&orgId=1&from=now-15m&to=now [08:18:26] sorry I wasn't around, I was in a short break [08:19:31] !log Optimize recentchanges table on s6 codfw - this will generate lag on codfw s6 - T178290 [08:19:32] I 'll do an incident reponse [08:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:34] T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290 [08:19:57] marostegui: ok to go with my reimage plan? [08:20:37] I was going to do db2038, db2047 and db2052 [08:23:24] let me seee [08:23:39] jynus: yep, all good [08:23:46] (03PS2) 10Filippo Giunchedi: graphite: add graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443785 (https://phabricator.wikimedia.org/T196483) [08:23:50] ok, starting [08:24:29] (03CR) 10Filippo Giunchedi: [C: 032] graphite: add graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443785 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi) [08:29:09] PROBLEM - puppet last run on graphite2003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 8 seconds ago with 5 failures. Failed resources (up to 3 shown) [08:29:38] PROBLEM - Check systemd state on graphite2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:30:26] (03PS2) 10Giuseppe Lavagetto: redis: remove cronjob for restarts on slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/443772 (https://phabricator.wikimedia.org/T198220) [08:30:28] (03PS2) 10Giuseppe Lavagetto: redis: remove now-useless specific classes [puppet] - 10https://gerrit.wikimedia.org/r/443773 (https://phabricator.wikimedia.org/T198220) [08:30:30] (03PS1) 10Giuseppe Lavagetto: site.pp: move slave redises to system::spare [puppet] - 10https://gerrit.wikimedia.org/r/443789 (https://phabricator.wikimedia.org/T198220) [08:31:38] RECOVERY - Check systemd state on graphite2003 is OK: OK - running: The system is fully operational [08:31:55] (03PS3) 10Giuseppe Lavagetto: redis: remove cronjob for restarts on slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/443772 (https://phabricator.wikimedia.org/T198220) [08:34:38] https://wikitech.wikimedia.org/wiki/Incident_documentation/20180704-mathoid [08:35:18] PROBLEM - MD RAID on kubernetes1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:35:18] PROBLEM - Check size of conntrack table on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:35:18] PROBLEM - configured eth on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:36:49] PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:36:49] PROBLEM - dhclient process on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:37:00] I should manually trigger a rebalancing [08:38:28] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:40:08] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:40:08] PROBLEM - configured eth on kubernetes1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:40:08] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:40:08] PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:40:19] RECOVERY - MD RAID on kubernetes1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:40:20] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: remove cronjob for restarts on slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/443772 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto) [08:41:09] RECOVERY - Check size of conntrack table on kubernetes1002 is OK: OK: nf_conntrack is 0 % full [08:41:09] RECOVERY - configured eth on kubernetes1002 is OK: OK - interfaces up [08:41:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443790 [08:41:48] PROBLEM - DPKG on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:43:04] RECOVERY - DPKG on kubernetes2001 is OK: All packages OK [08:43:23] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2001 is OK: OK ferm input default policy is set [08:43:34] RECOVERY - Check size of conntrack table on kubernetes2001 is OK: OK: nf_conntrack is 0 % full [08:43:34] RECOVERY - configured eth on kubernetes2001 is OK: OK - interfaces up [08:43:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443790 (owner: 10Marostegui) [08:45:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443790 (owner: 10Marostegui) [08:47:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 after alter table (duration: 00m 50s) [08:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443790 (owner: 10Marostegui) [08:49:34] RECOVERY - dhclient process on kubernetes2001 is OK: PROCS OK: 0 processes with command name dhclient [08:49:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443791 (https://phabricator.wikimedia.org/T191316) [08:51:07] (03PS1) 10Giuseppe Lavagetto: mw-maintenance: switch to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092) [08:51:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443791 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [08:51:30] <_joe_> akosiaris, apergos ^^ let's do it? [08:51:41] ahhh! [08:51:46] <_joe_> and send an announcement about turning off terbium on monday? [08:51:57] _joe_: +1 [08:52:38] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443791 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [08:52:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443791 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [08:53:08] !log update analytics-in4 filter rules on cr1/cr2 eqiad - T198623 [08:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:12] T198623: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 [08:53:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 for alter table (duration: 00m 50s) [08:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:03] !log Deploy schema change on db1123 T191316 T192926 T89737 T195193 [08:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:09] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:54:09] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [08:54:10] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [08:54:10] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [08:54:50] how do we move those cron jobs he was talking about? [08:56:21] !log uploaded linux 4.9.107~wmf1 for jessie-wikimedia to apt.wikimedia.org [08:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:11] (03CR) 10Gehel: [C: 032] wdqs: create log files during log rotation [puppet] - 10https://gerrit.wikimedia.org/r/443583 (owner: 10Gehel) [08:57:25] (03PS3) 10Gehel: wdqs: create log files during log rotation [puppet] - 10https://gerrit.wikimedia.org/r/443583 [08:58:18] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet operation_type={podsandbox_status,remove_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:58:38] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet operation_type={podsandbox_status,remove_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:00:02] !log manually rebalance the mathoid kubernetes production cluster namespaces pods wise [09:00:03] I see... he left some patchsets already prepped for us [09:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:18] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:00:22] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441381/ and then https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441346/ [09:00:47] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:00:55] yeah, _joe_ kind of merged them from what I gather [09:00:58] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type=container_status https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:01:13] <_joe_> yes [09:01:58] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:02:24] (03PS3) 10Giuseppe Lavagetto: redis: remove now-useless specific classes [puppet] - 10https://gerrit.wikimedia.org/r/443773 (https://phabricator.wikimedia.org/T198220) [09:02:38] RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational [09:03:00] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: remove now-useless specific classes [puppet] - 10https://gerrit.wikimedia.org/r/443773 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto) [09:03:47] !log Optimize recentchanges table on s2 codfw - this will generate lag on codfw s2 - T178290 [09:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:50] T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290 [09:04:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 (owner: 10Jcrespo) [09:04:38] (03PS1) 10Muehlenhoff: Bump meta package for new ABI in 4.9.107 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/443794 [09:05:07] +1 from me [09:05:48] (03Merged) 10jenkins-bot: mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 (owner: 10Jcrespo) [09:07:38] _joe_: ^^ [09:07:56] (03CR) 10jenkins-bot: mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 (owner: 10Jcrespo) [09:08:37] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2001 is OK: OK: synced at Wed 2018-07-04 09:08:29 UTC. [09:08:47] (03CR) 10Muehlenhoff: [C: 032] Bump meta package for new ABI in 4.9.107 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/443794 (owner: 10Muehlenhoff) [09:09:28] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2038, db2047 (duration: 00m 50s) [09:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:18] RECOVERY - puppet last run on kubernetes2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:12:22] volans: I am done with the reimages [09:12:53] akosiaris: thanks, I'm testing the code anyway with parallel files to not hurt anyone's work ;) [09:13:21] volans: so I guess you are doing deploy now? [09:13:53] jynus: no, still testing, and need to fine tune a couple of things for edge cases, feel free to proceed with your reimages [09:14:07] (03CR) 10Addshore: [C: 031] "Per https://www.mediawiki.org/wiki/Manual:$wgRightsPage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle) [09:14:22] I will not merge before checking with you all ;) [09:15:25] (03PS2) 10Gehel: Use backports version of osm2pgsql on Stretch for improved memory handling [puppet] - 10https://gerrit.wikimedia.org/r/443668 (https://phabricator.wikimedia.org/T198485) (owner: 10Mholloway) [09:15:36] !log reimage aqs1009 to Debian Stretch [09:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:52] volans: will do 2 reimages then with whatever version we have now [09:16:04] ack [09:16:04] !log uploaded linux-meta 1.18 for jessie-wikimedia to apt.wikimedia.org [09:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:36] (03CR) 10Gehel: [C: 032] Use backports version of osm2pgsql on Stretch for improved memory handling [puppet] - 10https://gerrit.wikimedia.org/r/443668 (https://phabricator.wikimedia.org/T198485) (owner: 10Mholloway) [09:16:52] ACKNOWLEDGEMENT - Long running screen/tmux on lawrencium is CRITICAL: NRPE: Command check_check_long_procs not defined Jcrespo Host to be decomm. https://phabricator.wikimedia.org/T191360 [09:18:31] hashar: you around by any chance? [09:21:13] (03CR) 10Addshore: [C: 031] "+1 to the move and removing things from these Wikibase* files." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle) [09:23:51] (03CR) 10Addshore: [C: 031] Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 (owner: 10Krinkle) [09:24:15] volans: yes I am :) [09:24:38] hashar: how can I see what a couple of CI jobs execute? [09:24:54] I'm interested in operations-dns-lint and operations-dns-tabs [09:25:16] apparently I don't have permissions to see the configuration of the jobs ;) [09:25:17] (03CR) 10Addshore: [C: 031] Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 (owner: 10Krinkle) [09:27:03] volans: hmm yeah maybe ops dont have access [09:27:16] (03PS1) 10Arturo Borrero Gonzalez: hieradata: profile::openstack::eqiad1::neutron::metadata_proxy_shared_secret [labs/private] - 10https://gerrit.wikimedia.org/r/443796 (https://phabricator.wikimedia.org/T196633) [09:27:38] (03CR) 10Addshore: [C: 031] Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535 (owner: 10Krinkle) [09:27:42] volans: for operations-dns-lint that is https://github.com/wikimedia/integration-config/blob/master/jjb/operations-misc.yaml#L22-L26 [09:27:47] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] hieradata: profile::openstack::eqiad1::neutron::metadata_proxy_shared_secret [labs/private] - 10https://gerrit.wikimedia.org/r/443796 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:28:00] which uses a script 'authdns-lint' shipped via puppet [09:28:04] ok [09:28:25] the -tabs job, it is a nasty find|xargs|grep something [09:28:41] (03PS1) 10Filippo Giunchedi: graphite: update graphite-auth for django 1.9 [puppet] - 10https://gerrit.wikimedia.org/r/443797 [09:28:43] (03PS1) 10Filippo Giunchedi: graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483) [09:28:46] (03PS1) 10Alexandros Kosiaris: labsdb1006: Reimage as stretch and make it osm::master [puppet] - 10https://gerrit.wikimedia.org/r/443799 (https://phabricator.wikimedia.org/T197246) [09:28:54] volans: https://github.com/wikimedia/integration-config/blob/17bb03adb5276b70ad87aadbfe6f31143cbd50e2/jjb/job-templates.yaml#L234-L247 [09:28:58] ok, so we don't have any easy 'entry point' for adding stuff there [09:29:04] !log upload kubernetes 1.8.14 to apt.wikimedia.org/stretch-wikimedia/main [09:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:29] thanks for the pointers to the code :) [09:29:40] (03CR) 10jerkins-bot: [V: 04-1] graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi) [09:31:13] (03CR) 10Addshore: [C: 031] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle) [09:32:20] (03PS2) 10Giuseppe Lavagetto: mw-maintenance: switch to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092) [09:32:22] (03PS1) 10Giuseppe Lavagetto: terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) [09:33:04] <_joe_> ok apergos, akosiaris I'm going on with the first change, can you take a look at the decom notice? [09:34:52] (03CR) 10Addshore: [C: 031] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle) [09:35:18] it's wrong fwiw [09:35:21] (03PS2) 10Filippo Giunchedi: graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483) [09:35:31] (03CR) 10Addshore: [C: 031] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 (owner: 10Krinkle) [09:35:34] it's applied on the wrong host. The decom motd::script I mean [09:35:59] it will be? [09:36:04] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova common profile [puppet] - 10https://gerrit.wikimedia.org/r/443802 (https://phabricator.wikimedia.org/T196633) [09:37:29] well it says This server will be decommissioned on July 9th; please use [09:37:30] mwmaint1001.eqiad.wmnet instead. [09:37:44] so I am assuming this means it is to be applied to terbium [09:37:51] I 'll upload a change [09:39:51] yes, mwmaint1001 is the final one, but AFAIK, terbium is not yet 100% unused [09:39:53] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova common profile [puppet] - 10https://gerrit.wikimedia.org/r/443802 (https://phabricator.wikimedia.org/T196633) [09:39:57] (03PS3) 10Giuseppe Lavagetto: mw-maintenance: switch to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092) [09:40:12] <_joe_> akosiaris: yeah sigh [09:40:18] <_joe_> I added it in the wrong place :D [09:40:19] ok fixed [09:40:37] i'll send mail to ops, engineering [09:40:45] <_joe_> apergos: oh thanks [09:40:50] <_joe_> maybe wikitech-l too? [09:40:58] <_joe_> wmde people have access as well [09:41:07] (03PS4) 10Alexandros Kosiaris: mw-maintenance: switch to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto) [09:41:09] <_joe_> either we just send an email to everyone who has access [09:41:09] (03PS2) 10Alexandros Kosiaris: terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto) [09:41:21] <_joe_> argh you overwritten my ps3 [09:41:24] (03CR) 10Ladsgroup: [C: 031] "I thought I did this. Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443686 (https://phabricator.wikimedia.org/T198768) (owner: 10Sbisson) [09:42:05] volans: I am off, but potentially we could craft a job that clones operations/puppet for the authdnslint script, and uses an entry point at the root of operations/dns.git (eg using make or a shell script or whatever) [09:42:18] volans: I am off though, but we can talk about it tomorrow morning [09:42:39] hasharAway: sure, no hurry [09:43:18] (03PS5) 10Giuseppe Lavagetto: mw-maintenance: switch to mwmaint1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092) [09:43:20] (03CR) 10Muehlenhoff: [C: 031] terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto) [09:43:48] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mw-maintenance: switch to mwmaint1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto) [09:44:31] <_joe_> !log stopping all cronjobs via a puppet run on terbium, T192092 [09:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:34] T192092: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092 [09:46:57] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova common profile [puppet] - 10https://gerrit.wikimedia.org/r/443802 (https://phabricator.wikimedia.org/T196633) [09:47:15] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is good:" [puppet] - 10https://gerrit.wikimedia.org/r/443802 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:47:35] (03PS1) 10Elukey: Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443805 [09:47:52] (03Abandoned) 10Elukey: Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443805 (owner: 10Elukey) [09:48:03] (03PS1) 10Elukey: Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443806 [09:48:10] (03PS2) 10Elukey: Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443806 [09:48:49] (03CR) 10Elukey: [C: 032] Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443806 (owner: 10Elukey) [09:51:39] I'd rather spam the three lists than try targetted email. draft is ready to go when the switch is complete and things look stable [09:52:17] (03PS3) 10Giuseppe Lavagetto: terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) [09:52:21] (03PS1) 10Jcrespo: mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807 [09:53:01] (03PS2) 10Filippo Giunchedi: graphite: update graphite-auth for django 1.9 [puppet] - 10https://gerrit.wikimedia.org/r/443797 [09:53:08] (03CR) 10Vgutierrez: [C: 032] Add IPv6 records for authdns2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/443580 (https://phabricator.wikimedia.org/T196664) (owner: 10Vgutierrez) [09:53:11] (03CR) 10Filippo Giunchedi: [C: 032] graphite: update graphite-auth for django 1.9 [puppet] - 10https://gerrit.wikimedia.org/r/443797 (owner: 10Filippo Giunchedi) [09:53:17] (03PS2) 10Vgutierrez: Add IPv6 records for authdns2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/443580 (https://phabricator.wikimedia.org/T196664) [09:53:20] (03PS2) 10Jcrespo: mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807 [09:54:13] (03PS3) 10Jcrespo: mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807 [09:54:49] <_joe_> meh, if we only didn't use ff-only... [09:54:55] (03CR) 10Giuseppe Lavagetto: [C: 032] terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto) [09:54:56] :-) [09:55:09] <_joe_> I hate I lose ~ 30 mins/week to this [09:55:19] (03PS4) 10Giuseppe Lavagetto: terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) [09:55:23] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto) [09:59:51] RECOVERY - puppet last run on graphite2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:01:20] !log rolling reboot of sca* for "lazy fpu" kernel updates [10:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:19] (03PS1) 10Giuseppe Lavagetto: site.pp: change priority of the motd, fix script [puppet] - 10https://gerrit.wikimedia.org/r/443808 [10:07:45] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: change priority of the motd, fix script [puppet] - 10https://gerrit.wikimedia.org/r/443808 (owner: 10Giuseppe Lavagetto) [10:09:31] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [10:11:33] <_joe_> uh? ^^ [10:12:27] <_joe_> that's codfw, expected? [10:13:44] _joe_: it seems there's been a spike in traffic and then a drop to normal levels [10:13:54] <_joe_> yeah I was looking as well [10:14:05] <_joe_> interesting we only alert on the drop :D [10:14:05] same [10:14:33] I guess we won't alert on a spike unless it's big enough to put us out of business [10:14:59] (03PS3) 10Filippo Giunchedi: graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483) [10:15:08] yeah the idea behind that alert is that we want to know when a DC gets a suspiciously low amount of traffic (because eg. router drama) [10:15:11] <_joe_> the drop in codfw is kinda strange though [10:15:19] <_joe_> ema: yeah I get the reasoning [10:15:21] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10jcrespo) Sorry, you didn't understood what I meant- for ORES, it was: T159753 and for translation, T183485, both as sum... [10:16:16] the drop is text-only it seems [10:16:42] (03CR) 10Filippo Giunchedi: [C: 032] graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi) [10:16:58] however, in line with last week: https://grafana.wikimedia.org/dashboard/db/varnish-caching-last-week-comparison?refresh=15m&orgId=1&var-cluster=text&var-site=codfw&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [10:17:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443809 [10:21:29] (03PS1) 10Filippo Giunchedi: graphite: fix require_package [puppet] - 10https://gerrit.wikimedia.org/r/443810 (https://phabricator.wikimedia.org/T196483) [10:22:11] PROBLEM - puppet last run on graphite2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:22:28] known ^ [10:22:36] (03CR) 10Filippo Giunchedi: [C: 032] graphite: fix require_package [puppet] - 10https://gerrit.wikimedia.org/r/443810 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi) [10:22:43] (03PS2) 10Filippo Giunchedi: graphite: fix require_package [puppet] - 10https://gerrit.wikimedia.org/r/443810 (https://phabricator.wikimedia.org/T196483) [10:23:40] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10awight) @jcrespo I see, well in this case content storage is exactly what we're planning to use. Is there anything sp... [10:23:42] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:24:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443809 (owner: 10Marostegui) [10:25:28] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443809 (owner: 10Marostegui) [10:26:13] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) [10:26:48] (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [10:27:03] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10jcrespo) Ok, that is much better, but I guess it still would double the revision table (or the 5 new tables that are to... [10:27:12] RECOVERY - puppet last run on graphite2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:27:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443809 (owner: 10Marostegui) [10:28:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 after alter table (duration: 00m 50s) [10:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:12] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:29:31] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:13] _joe_: do we want to manually edit the www-data crontab on terbium so no new jobs will start? [10:33:38] <_joe_> apergos: those left are just silver leftovers [10:33:43] <_joe_> they don't do anything [10:33:50] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807 (owner: 10Jcrespo) [10:33:52] <_joe_> but yeah, be my guest :) [10:33:53] godog: modules/profile/manifests/labs/monitoring.pp declared mod-uwsgi with a plain package, so puppet trips over the duplicate declaration [10:33:58] (03PS4) 10Jcrespo: mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807 [10:34:26] moritzm: sigh, thanks [10:35:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443815 (https://phabricator.wikimedia.org/T191316) [10:35:41] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [10:37:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443815 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:38:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443815 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:38:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443815 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:38:36] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) [10:39:01] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:39:13] (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [10:39:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 for alter table (duration: 00m 50s) [10:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:21] !log Stop replication on db1077 to drop triggers on db1124:3313 - T192926 [10:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:25] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [10:43:16] (03PS1) 10Filippo Giunchedi: labs: move libapache2-mod-uwsgi to graphite::web [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483) [10:44:22] moritzm arturo ^ [10:44:28] looking [10:44:42] godog: ? [10:45:08] (I ignore wikibugs, if that's what you are pointing to) [10:45:36] arturo: ah yeah, I was pointing to https://gerrit.wikimedia.org/r/c/operations/puppet/+/443816 [10:46:09] FWIW I have traffic bot show up as NOTICEs not as PRIVMSGs as it should be [10:46:28] helps with telling the two apart [10:46:34] !log Deploy schema change on db1077 with replication, this will generate lag on labs s3 T191316 T192926 T89737 T195193 [10:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:40] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [10:46:41] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [10:46:41] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [10:48:20] (03CR) 10Muehlenhoff: [C: 031] "Looks good (ideally the remaining package declarations were also switched to require_package, which would have prevented that error also)," [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi) [10:48:32] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "I would say run this with puppet catalog compiler to make sure it wont break due to the package being required elsewhere (ordering, whatev" [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi) [10:49:58] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) [10:53:54] !log stop db2038 and db2047 [10:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:56] (03PS4) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) [10:58:45] !log update compiler facts [10:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:31] godog: labmon1001.eqiad.wmnet instead of .wikimedia.org ? [11:00:17] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is good to go:" [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:00:55] arturo: yeah that was it, then I realized I need new facts for graphite2003 anyway [11:01:06] great [11:01:30] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/11663/" [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi) [11:01:33] thanks BTW godog :-) [11:01:39] (03PS2) 10Filippo Giunchedi: labs: move libapache2-mod-uwsgi to graphite::web [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483) [11:01:48] hehe no problem arturo ! simple enough fix [11:05:53] mhh puppet-facts-export fails with KeyError: 'trusted' [11:05:59] will take a look after lunch [11:07:53] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:03] ACKNOWLEDGEMENT - MD RAID on cp3048 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T198784 [11:08:42] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:09:28] oh ^^^^ these puppet errors seems my fault [11:09:37] !log cp3043: mdadm /dev/md0 -- fail /dev/sdb1 [11:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:21] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:13:33] ACKNOWLEDGEMENT - MD RAID on cp3043 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T198785 [11:14:13] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:15:17] ACKNOWLEDGEMENT - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues cpettet Ok [11:17:18] !log cp3043: mdadm /dev/md0 --add /dev/sdc1 (sdc is former cp3048:sdb) [11:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:22] ACKNOWLEDGEMENT - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues cpettet Ok [11:18:37] (03PS1) 10Arturo Borrero Gonzalez: openstack: pass version hiera key down to nova conductor service class [puppet] - 10https://gerrit.wikimedia.org/r/443817 [11:22:23] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler seems OK:" [puppet] - 10https://gerrit.wikimedia.org/r/443817 (owner: 10Arturo Borrero Gonzalez) [11:27:33] jouncebot: next [11:27:33] In 25 hour(s) and 32 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180705T1300) [11:27:53] !log resuming rolling restart of cassandra on restbase hosts in eqiad completed [11:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443818 [11:28:02] !log rolling restart of cassandra on restbase hosts in eqiad completed [11:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:07] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:29:15] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:29:27] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:29:27] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:29:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova scheduler profile [puppet] - 10https://gerrit.wikimedia.org/r/443819 (https://phabricator.wikimedia.org/T196633) [11:35:12] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: add nova scheduler profile [puppet] - 10https://gerrit.wikimedia.org/r/443819 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:35:23] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is OK: https://puppet-compiler.wmflabs.org/compiler02/11668/" [puppet] - 10https://gerrit.wikimedia.org/r/443819 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:38:45] PROBLEM - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:39:35] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 132 not-conn: cp3034_v4, cp3034_v6, cp3048_v4, cp3048_v6 [11:40:06] PROBLEM - Host cp3048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:06] ACKNOWLEDGEMENT - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 132 not-conn: cp3034_v4, cp3034_v6, cp3048_v4, cp3048_v6 Ema hw maintenance [11:40:40] ACKNOWLEDGEMENT - Host cp3048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ema hw maintenance [11:40:40] ACKNOWLEDGEMENT - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ema hw maintenance [11:44:05] RECOVERY - Host cp3034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.32 ms [11:52:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443818 (owner: 10Marostegui) [11:53:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443818 (owner: 10Marostegui) [11:54:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 after alter table (duration: 00m 52s) [11:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:25] RECOVERY - Host cp3048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.33 ms [11:58:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443818 (owner: 10Marostegui) [12:04:08] !log installing ruby 1.9 security updates on trusty [12:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:11] (03CR) 10Aklapper: "Tim Starling: See T198570" [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4) [12:19:39] 10Operations, 10ops-esams: Relabel hooft to bast3002 - https://phabricator.wikimedia.org/T198790 (10mark) [12:25:55] PROBLEM - Host snapshot1005 is DOWN: PING CRITICAL - Packet loss = 100% [12:26:01] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-api [puppet] - 10https://gerrit.wikimedia.org/r/443824 (https://phabricator.wikimedia.org/T196633) [12:27:27] I don't see any log about restarting snapshot1005 [12:27:53] apergos or arturo ^ someting you may be aware of? [12:28:07] no [12:28:19] so either crash or network loss [12:28:25] checking [12:28:50] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-api [puppet] - 10https://gerrit.wikimedia.org/r/443824 (https://phabricator.wikimedia.org/T196633) [12:29:02] no idea [12:31:01] "The server is not powered on" [12:31:12] Server Power: Off [12:31:16] that's nice [12:31:40] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is good for:" [puppet] - 10https://gerrit.wikimedia.org/r/443824 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:31:59] jynus: I am on the mgmt interface [12:32:08] going to power it back up, ok? [12:32:09] !log shutting down bast3002 for disk replacement [12:32:10] I will let you handle [12:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:13] cool [12:32:14] ok for me [12:32:18] (use a different bastion for few minutes) [12:33:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443825 (https://phabricator.wikimedia.org/T197069) [12:36:17] nice [12:36:25] doesn't want to power back on [12:36:37] I would check the lifecycle log [12:36:47] maybe a power failure was detected [12:37:08] or something else replacable [12:40:37] ACKNOWLEDGEMENT - MD RAID on bast3002 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T198791 [12:41:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443825 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [12:42:29] 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T198791 (10Volans) [12:42:31] 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T183814 (10Volans) [12:42:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443825 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [12:43:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443825 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [12:46:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1089 for maintenance - T197069 (duration: 02m 57s) [12:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:26] T197069: Failover db1052 (s1) db primary master - https://phabricator.wikimedia.org/T197069 [12:51:56] (03PS1) 10Marostegui: db1089.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/443826 (https://phabricator.wikimedia.org/T197069) [12:53:13] (03CR) 10Marostegui: [C: 032] db1089.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/443826 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [12:56:18] !log Stop MySQL and reboot db1089 to upgrade+change it to statement - T197069 [12:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:22] T197069: Failover db1052 (s1) db primary master - https://phabricator.wikimedia.org/T197069 [13:03:27] (03PS1) 10Ema: cache_misc: decommission cp3009 [puppet] - 10https://gerrit.wikimedia.org/r/443827 (https://phabricator.wikimedia.org/T148422) [13:11:09] (03PS1) 10ArielGlenn: switch en wiki dumps to run on snapshot1009 for now [puppet] - 10https://gerrit.wikimedia.org/r/443828 (https://phabricator.wikimedia.org/T198792) [13:11:43] (03CR) 10ArielGlenn: [C: 04-1] "do not merge, this is just ready in case we need it" [puppet] - 10https://gerrit.wikimedia.org/r/443828 (https://phabricator.wikimedia.org/T198792) (owner: 10ArielGlenn) [13:25:08] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443829 [13:26:01] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2038 and db2047 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443830 [13:28:07] (03CR) 10Muehlenhoff: "a" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [13:28:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443829 (owner: 10Marostegui) [13:28:32] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2038 and db2047 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443830 (owner: 10Jcrespo) [13:28:41] moritzm: I think I need some additinal context :-P ^^^^ ('a') [13:28:49] jynus: you go first :) [13:28:49] 10Operations, 10ops-esams: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403 (10mark) 05Open>03Resolved This has now been corrected and verified on-site. [13:29:19] (03CR) 10Muehlenhoff: [C: 031] Refactor client authentication (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [13:29:45] volans: UI fail, now properly added as a comment :-) [13:29:46] lol, thx [13:30:01] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443829 (owner: 10Marostegui) [13:30:04] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2038 and db2047 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443830 (owner: 10Jcrespo) [13:30:17] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443829 (owner: 10Marostegui) [13:33:22] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={create_container,podsandbox_status,remove_container,run_podsandbox,start_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:34:00] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2038, db2047 (duration: 02m 56s) [13:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:17] I know which ssh: connect to host snapshot1005.eqiad.wmnet [13:34:23] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:34:44] as you will be deploying next, marostegui, can you coordinate with apergos to depool it? [13:34:48] from scap [13:35:05] ah I have that in my patchset.... [13:35:15] I will wait for you to depool it then :) [13:35:16] when is the next deploy scheduled? [13:35:17] now? [13:35:18] so heads up, ^ marostegui [13:35:19] before my deploy [13:35:42] oh yeah, wmf-config [13:35:43] sigh [13:35:45] yeah :) [13:35:52] all right I'll just push it through now nbd [13:36:23] the others went good and no errors [13:36:37] plust codfw should not create much issues worse case scenario [13:36:44] I guess it's not conftool-driven snapshot1005 [13:37:13] apergos: let me know when done and I will deploy and confirm if I see errors again [13:37:13] RECOVERY - Device not healthy -SMART- on bast3002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast3002&var-datasource=esams%2520prometheus%252Fops [13:37:42] (03CR) 10ArielGlenn: [C: 032] switch en wiki dumps to run on snapshot1009 for now [puppet] - 10https://gerrit.wikimedia.org/r/443828 (https://phabricator.wikimedia.org/T198792) (owner: 10ArielGlenn) [13:37:54] see you [13:38:14] jynus: o/ [13:40:07] (03PS14) 10Gehel: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [13:40:09] (03PS17) 10Gehel: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (owner: 10EBernhardson) [13:40:11] (03PS21) 10Gehel: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson) [13:40:13] (03PS24) 10Gehel: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (owner: 10EBernhardson) [13:40:15] (03PS52) 10Gehel: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [13:40:30] marostegui: that should do it, live on deploy1001 [13:40:32] cmjohnson1: hey, I was checking ms-be1036, host is fine but I can't reach its ipmi via ssh [13:40:43] apergos: cool, let me see [13:40:47] I think chris is out the rest of this week, go dog [13:41:08] ah! thanks apergos [13:41:18] (03CR) 10jerkins-bot: [V: 04-1] prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson) [13:41:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 for after maintenance (duration: 00m 50s) [13:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] apergos: worked fine, no timeouts on that host! [13:41:49] thanks [13:42:00] volans: no these are not conftool configured, we don't want to depool unless someone has done so deliberately [13:42:11] it's a tiny cluster, we need almost every box [13:42:31] maroste gui: great! [13:44:38] ack, thanks for the info ;) [13:47:53] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443834 [13:47:57] Hey guys, a blackout on itwiki/eswiki, and the other wikis is fine and all, but is it really appropriate to be redirecting the API to that blackout page. I can safely say bots were never designed to handle redirects from the API. [13:48:03] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:48:25] IABot is throwing a tantrum right now. [13:49:13] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:51:18] Cyberpower678: I'd probably file a task for that [13:51:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443834 (owner: 10Marostegui) [13:51:32] Reedy: which project? [13:52:23] Honestly not sure [13:52:38] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443834 (owner: 10Marostegui) [13:52:38] * Cyberpower678 makes a high priority task. [13:52:43] (03CR) 10Gehel: "puppet compiler looks reasonable: https://puppet-compiler.wmflabs.org/compiler02/11671/" [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [13:54:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 for after maintenance (duration: 00m 50s) [13:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:33] Reedy: nevermind my bad. The tantrum is a different problem. The api.php root page is only redirecting. All the actions are still functional. [13:56:54] Ah, ok :) [13:57:05] * Cyberpower678 just checks to make sure. [13:58:23] * Cyberpower678 accesses the run logs of IABot [14:08:46] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [14:10:03] (03PS5) 10Filippo Giunchedi: Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) [14:10:28] !log installing file/libmagic security updates on trusty (Debian already fixed) [14:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:45] 10Operations, 10Cassandra, 10Services (watching), 10User-Eevans: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590 (10fgiunchedi) >>! In T128590#4396369, @Eevans wrote: > @fgiunchedi, is this still a thing? Good question, I think we'll h... [14:12:08] (03CR) 10Filippo Giunchedi: [C: 032] Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [14:17:20] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443835 [14:18:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443835 (owner: 10Marostegui) [14:20:11] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443835 (owner: 10Marostegui) [14:22:35] (03PS4) 10Muehlenhoff: Allow removing Diamond gradually [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) [14:22:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 for after maintenance (duration: 00m 50s) [14:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:07] (03PS1) 10Alexandros Kosiaris: kubernetes: Switch to all in firewalling policy [puppet] - 10https://gerrit.wikimedia.org/r/443836 [14:24:59] (03PS2) 10Alexandros Kosiaris: kubernetes::staging: Switch to all in firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/443836 [14:25:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes::staging: Switch to all in firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/443836 (owner: 10Alexandros Kosiaris) [14:25:44] !log upgrade kubernetes staging API server to 1.8.14 [14:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:31] (03PS1) 10Andrew Bogott: labsaliaser: suppress stdout in the cronjob [puppet] - 10https://gerrit.wikimedia.org/r/443837 [14:29:07] (03PS2) 10Andrew Bogott: labsaliaser: suppress stdout in the cronjob [puppet] - 10https://gerrit.wikimedia.org/r/443837 [14:30:12] (03CR) 10Andrew Bogott: [C: 032] labsaliaser: suppress stdout in the cronjob [puppet] - 10https://gerrit.wikimedia.org/r/443837 (owner: 10Andrew Bogott) [14:31:02] (03PS1) 10Alexandros Kosiaris: Adjust kubelet latencies thresholds [puppet] - 10https://gerrit.wikimedia.org/r/443838 [14:33:20] (03CR) 10Alexandros Kosiaris: [C: 032] Adjust kubelet latencies thresholds [puppet] - 10https://gerrit.wikimedia.org/r/443838 (owner: 10Alexandros Kosiaris) [14:36:44] !log installing perl security updates on trusty (Debian already fixed) [14:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:21] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet operation_type={podsandbox_status,remove_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:39:30] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:52:19] !log installing libipc-run-perl updates from jessie point release [14:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:35] (03CR) 10Filippo Giunchedi: [C: 032] "I've changed the graphite udp dashboards to use Prometheus for UDP instead of Graphite" [puppet] - 10https://gerrit.wikimedia.org/r/442865 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [14:55:42] (03PS2) 10Filippo Giunchedi: statsite: deprecate Diamond udp collector [puppet] - 10https://gerrit.wikimedia.org/r/442865 (https://phabricator.wikimedia.org/T183454) [15:00:24] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:02:24] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:12:20] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10mark) [15:19:51] (03CR) 10Alexandros Kosiaris: [C: 031] Allow to delete hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443362 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [15:20:28] 10Operations, 10Cassandra, 10Services (watching), 10User-Eevans: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590 (10mobrovac) 05Open>03stalled [15:27:46] (03CR) 10Alexandros Kosiaris: [C: 031] Allow to link hosts to external resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443363 (https://phabricator.wikimedia.org/T198590) (owner: 10Volans) [15:28:41] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, but no DataTables expert here." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443364 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans) [15:29:35] (03PS2) 10Giuseppe Lavagetto: mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968) [15:29:37] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: move mediawiki.org, test.wikidata to individual files [puppet] - 10https://gerrit.wikimedia.org/r/443842 (https://phabricator.wikimedia.org/T196968) [15:29:39] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: convert mediawiki.org to mediawiki::vhost [puppet] - 10https://gerrit.wikimedia.org/r/443843 (https://phabricator.wikimedia.org/T196968) [15:29:41] (03CR) 10Alexandros Kosiaris: [C: 031] DataTables: cleanup initialization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443365 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans) [15:31:16] (03CR) 10Alexandros Kosiaris: [C: 031] DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [15:33:51] (03PS1) 10Ema: Decommission ex-cache_maps: cp300[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/443845 (https://phabricator.wikimedia.org/T167376) [15:34:40] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: move mediawiki.org, test.wikidata to individual files [puppet] - 10https://gerrit.wikimedia.org/r/443842 (https://phabricator.wikimedia.org/T196968) [15:37:46] (03PS1) 10Ema: Remove DNS entries for esams ex-cache_maps: cp300[3-6] [dns] - 10https://gerrit.wikimedia.org/r/443846 (https://phabricator.wikimedia.org/T167376) [15:38:18] (03PS2) 10Ema: Decommission esams ex-cache_maps: cp300[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/443845 (https://phabricator.wikimedia.org/T167376) [15:44:04] (03CR) 10Ema: [C: 032] Decommission esams ex-cache_maps: cp300[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/443845 (https://phabricator.wikimedia.org/T167376) (owner: 10Ema) [15:45:20] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443847 [15:46:35] (03PS1) 10Mobrovac: c-for-each: Increase retry and delay defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/443848 (https://phabricator.wikimedia.org/T198787) [15:47:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443847 (owner: 10Marostegui) [15:47:21] (03PS2) 10Mobrovac: c-foreach-restart: Increase retry and delay defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/443848 (https://phabricator.wikimedia.org/T198787) [15:48:35] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443847 (owner: 10Marostegui) [15:50:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 for after maintenance (duration: 00m 51s) [15:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:40] !log Optimize dewiki.logging on s5 codfw master with replication, this will generate lag on s5 codfw - T197459 [15:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:44] T197459: Optimize logging table - https://phabricator.wikimedia.org/T197459 [15:57:02] (03PS2) 10Ema: Remove production DNS entries for esams ex-cache_maps [dns] - 10https://gerrit.wikimedia.org/r/443846 (https://phabricator.wikimedia.org/T167376) [15:58:07] (03PS3) 10Ema: Remove production DNS entries for esams ex-cache_maps [dns] - 10https://gerrit.wikimedia.org/r/443846 (https://phabricator.wikimedia.org/T167376) [15:58:54] (03CR) 10Ema: [C: 032] Remove production DNS entries for esams ex-cache_maps [dns] - 10https://gerrit.wikimedia.org/r/443846 (https://phabricator.wikimedia.org/T167376) (owner: 10Ema) [16:01:49] (03PS1) 10Ema: Remove mgmt DNS entries for esams ex-cache_maps [dns] - 10https://gerrit.wikimedia.org/r/443851 (https://phabricator.wikimedia.org/T167376) [16:05:39] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:15] marostegui: cp3008 down? ^ [16:06:24] sorry, I mean mark [16:07:08] RECOVERY - Host cp3008 is UP: PING OK - Packet loss = 0%, RTA = 84.08 ms [16:23:36] no alert for lvs300x management? :) [16:24:16] mark: only lvs3001.mgmt [16:25:02] actually just recovered [16:31:33] sure [16:31:37] i rerouted those cables [16:44:12] (03CR) 10Thiemo Kreuz (WMDE): "wgRightsUrl is actually used as a default for the "dataRightsUrl" setting WikibaseRepo uses for the little JS popup messages that appear w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle) [16:45:36] (03PS4) 10Filippo Giunchedi: WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T171482) [16:46:07] (03CR) 10jerkins-bot: [V: 04-1] WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T171482) (owner: 10Filippo Giunchedi) [16:56:56] (03PS7) 10Volans: wmf-auto-reimage: validate certificate fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/433928 [16:56:58] (03PS5) 10Volans: wmf-auto-reimage: improve donwtime of reimaged host [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) [16:57:00] (03PS3) 10Volans: wmf-auto-reimage: use absolute path for subprocess [puppet] - 10https://gerrit.wikimedia.org/r/434896 [16:57:02] (03PS2) 10Volans: wmf-auto-reimage: fix parse argument bug [puppet] - 10https://gerrit.wikimedia.org/r/443670 [16:57:04] (03PS2) 10Volans: wmf-auto-reimage: use warning log level [puppet] - 10https://gerrit.wikimedia.org/r/443671 [16:58:54] (03CR) 10Volans: "@vgutierrez: since last review I've fixed two bugs (installer=True and --color=false) and improved the logging" [puppet] - 10https://gerrit.wikimedia.org/r/433928 (owner: 10Volans) [17:01:12] (03CR) 10Volans: "The latest changes are:" [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) (owner: 10Volans) [17:08:06] sorry for the spam ;) [17:17:22] (03PS2) 10Volans: Refactor client authentication [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) [17:17:24] (03PS2) 10Volans: Allow to delete hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443362 (https://phabricator.wikimedia.org/T198526) [17:17:26] (03PS2) 10Volans: Allow to link hosts to external resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443363 (https://phabricator.wikimedia.org/T198590) [17:17:28] (03PS2) 10Volans: DataTables: save state for the session [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443364 (https://phabricator.wikimedia.org/T198591) [17:17:30] (03PS2) 10Volans: DataTables: cleanup initialization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443365 (https://phabricator.wikimedia.org/T198591) [17:17:32] (03PS2) 10Volans: Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) [17:17:34] (03PS2) 10Volans: DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) [17:17:36] (03PS2) 10Volans: Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) [17:17:48] (03CR) 10Volans: Refactor client authentication (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [17:18:11] (03CR) 10Volans: DataTables: refactor column grouping (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [17:18:26] (03CR) 10jerkins-bot: [V: 04-1] Refactor client authentication [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [17:18:41] (03CR) 10jerkins-bot: [V: 04-1] Allow to delete hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443362 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [17:18:46] (03CR) 10jerkins-bot: [V: 04-1] Allow to link hosts to external resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443363 (https://phabricator.wikimedia.org/T198590) (owner: 10Volans) [17:18:59] (03CR) 10jerkins-bot: [V: 04-1] DataTables: save state for the session [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443364 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans) [17:19:19] (03CR) 10jerkins-bot: [V: 04-1] DataTables: cleanup initialization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443365 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans) [17:19:21] (03CR) 10jerkins-bot: [V: 04-1] Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [17:19:23] (03CR) 10jerkins-bot: [V: 04-1] Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [17:19:31] (03CR) 10jerkins-bot: [V: 04-1] DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [17:31:24] _joe_, Krinkle: so, I'd like to swat https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/440469/ tomorrow FYI [17:59:59] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:04:19] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:04:39] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:05:48] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:36:18] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:40:48] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:47:28] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:59:29] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:04:59] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:07:09] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:31:19] RECOVERY - MD RAID on bast3002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [20:15:01] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2038 and db2047 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443830 (owner: 10Jcrespo) [20:15:03] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443834 (owner: 10Marostegui) [20:15:05] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443835 (owner: 10Marostegui) [20:15:07] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443847 (owner: 10Marostegui) [20:48:49] (03PS3) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 [20:48:59] (03PS3) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 [21:11:11] (03CR) 10Krinkle: "Indeed, on the cluster it is set directly in Wikibase.php:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle) [21:11:17] (03PS2) 10Krinkle: Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 [21:15:47] (03CR) 10Krinkle: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle) [21:15:49] (03PS2) 10Krinkle: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 [21:22:28] (03CR) 10Krinkle: [C: 032] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle) [21:22:40] * Krinkle staging on depoy1001 and mwdebug1002 [21:23:48] (03Merged) 10jenkins-bot: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle) [21:24:52] (03CR) 10Reedy: "Those defaults probably want bumping again :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle) [21:27:02] Reedy: Aye, bumping them can be scary though in terms of impact. I quite like them to be far in the past :P [21:27:14] I used to do 6 months at a time [21:27:25] Just to confirm though, do you foresee an issue with moving them? [21:27:32] Move them to yesterday, sure [21:27:40] Move them to a year ago... Not so much [21:27:42] I mean moving the code. [21:27:48] oh : [21:27:49] :P [21:27:58] (03CR) 10jenkins-bot: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle) [21:28:03] Should be fine [21:30:20] (03CR) 10Krinkle: [C: 032] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 (owner: 10Krinkle) [21:30:30] I guess it was in CommonSettings because no overrides at one point [21:30:53] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Ie30aeecbe (duration: 00m 52s) [21:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:35] (03Merged) 10jenkins-bot: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 (owner: 10Krinkle) [21:32:29] (03CR) 10jenkins-bot: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 (owner: 10Krinkle) [21:33:22] (03CR) 10Krinkle: [C: 032] Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle) [21:34:07] !log krinkle@deploy1001 Synchronized wmf-config/: I1e78ba4365 (duration: 00m 52s) [21:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:40] (03Merged) 10jenkins-bot: Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle) [21:35:49] (03PS3) 10Krinkle: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 [21:36:22] (03CR) 10Krinkle: [C: 032] Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle) [21:36:28] !log krinkle@deploy1001 Synchronized wmf-config/: If7da2a26bbf (duration: 00m 51s) [21:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:56] (03CR) 10jenkins-bot: Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle) [21:37:38] (03Merged) 10jenkins-bot: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle) [21:41:47] (03CR) 10jenkins-bot: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle) [21:42:36] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Ia8deabc5d2625 (duration: 00m 50s) [21:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:38] !log krinkle@deploy1001 Synchronized wmf-config/: Ia8deabc5d2625 (duration: 00m 51s) [21:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:45] (03PS3) 10Krinkle: Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 [21:44:53] (03CR) 10Krinkle: [C: 032] Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 (owner: 10Krinkle) [21:46:32] (03Merged) 10jenkins-bot: Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 (owner: 10Krinkle) [21:47:03] (03PS3) 10Krinkle: Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 [21:49:16] (03CR) 10Krinkle: [C: 032] Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 (owner: 10Krinkle) [21:49:55] !log krinkle@deploy1001 Synchronized wmf-config/: I6f8cfa8f (duration: 00m 51s) [21:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:28] (03Merged) 10jenkins-bot: Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 (owner: 10Krinkle) [21:50:52] (03CR) 10jenkins-bot: Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 (owner: 10Krinkle) [21:52:52] (03PS2) 10Krinkle: Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535 [21:53:35] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Iab176f3d205 (duration: 00m 50s) [21:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:22] (03CR) 10Krinkle: [C: 032] Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535 (owner: 10Krinkle) [21:55:37] (03Merged) 10jenkins-bot: Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535 (owner: 10Krinkle) [22:06:00] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I5a6e28814ba6b7 (duration: 00m 51s) [22:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:34] (03PS2) 10Krinkle: Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 (owner: 10Reedy) [22:17:37] (03CR) 10Krinkle: [C: 032] Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 (owner: 10Reedy) [22:18:58] (03Merged) 10jenkins-bot: Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 (owner: 10Reedy) [22:23:58] (03PS3) 10Krinkle: Move if onto newline in FeaturedFeedsWMF.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 (owner: 10Reedy) [22:24:05] Reedy: Mind if I land that as-is? [22:24:31] Yeah [22:24:45] That file is a messss [22:25:25] (03CR) 10Krinkle: [C: 04-1] "Could you explain the problem and solution in more detail? It's not exactly clear to me in what way this was a duplicate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [22:25:35] (03CR) 10Krinkle: [C: 032] Move if onto newline in FeaturedFeedsWMF.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 (owner: 10Reedy) [22:25:54] Well, it's happy hour and you've got two patches for the price of 0. [22:26:02] Get 'em while you can :P [22:26:52] (03Merged) 10jenkins-bot: Move if onto newline in FeaturedFeedsWMF.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 (owner: 10Reedy) [22:28:44] !log krinkle@deploy1001 Synchronized wmf-config/FeaturedFeedsWMF.php: I004bc9c3e71 (duration: 00m 50s) [22:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:08] (03PS1) 10Reedy: Bump default cache epochs from 20130601 to 20160101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443866 [22:30:21] :P [22:31:32] Or do you want 2015? ;) [22:31:37] Reedy: So the reason I think it's high-impact (potentially) isn't due to purging of old cache values (that'd be good, and would only be a minority that old, we should be fine there), but due to purging of current values that use it as a hashing source. [22:31:41] At least RL uses it that way [22:31:55] which means bumping it or changing it in any way is the same as clearing all caches of all wikis. [22:32:58] It's a bit of a nuclear approach. I'd actually support removing its use from several code paths one by one. [22:43:49] Hm.. it's quite possible ThumbnailEpoch doesn't affect us anymore given that we use the 404 handler. [22:43:52] And also, Thumbor. [22:44:22] It's only used if RENDER_NOW is passed, which happens for UploadStash and ThumbnailJob, which are for recently uploaded files only. [22:49:47] (03CR) 10Krinkle: [C: 04-1] "Would recommend adding this to the multiversion/ directory instead so that PSR-4 and namespaces within wmf-config can be evaluated and dec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [22:51:58] (03CR) 10Urbanecm: "It was suggested by @MarcoAurelio in 440002. He says as all bureaucrats are sysops, it makes no sense to assign sysop-level privileges to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [22:52:28] (03CR) 10Urbanecm: "(to explain myself: sysops are allowed to grant IPBE everywhere)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [23:23:46] (03PS1) 10Krinkle: Improve file-level documentation for various wmf-config files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443870 [23:27:29] (03PS2) 10Krinkle: Improve file-level documentation for various wmf-config files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443870