[00:02:24] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:07:55] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:24:35] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:25:44] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:26:48] <wikibugs>	 (03PS2) 10Krinkle: Change logo for eswiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) (owner: 10Urbanecm)
[00:27:04] <wikibugs>	 (03CR) 10Krinkle: [C: 032] "ImageOptim was able to compress it slightly more via AdvPNG." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) (owner: 10Urbanecm)
[00:27:21] * Krinkle stating on deploy1001 and mwdebug1002
[00:27:25] <Krinkle>	 staging*
[00:27:47] <Krinkle>	 Urbanecm: stand by for verification :)
[00:28:24] <wikibugs>	 (03Merged) 10jenkins-bot: Change logo for eswiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) (owner: 10Urbanecm)
[00:28:37] <wikibugs>	 (03CR) 10jenkins-bot: Change logo for eswiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443742 (https://phabricator.wikimedia.org/T198761) (owner: 10Urbanecm)
[00:30:16] <Krinkle>	 Urbanecm: live on mwdebug1002 now
[00:33:39] <Krinkle>	 abian: Urbanecm: Please verify and confirm that this should deploy now. Is there another change you're trying to sync it with on-wiki?
[00:34:08] <Krinkle>	 Ah, it seems it was already changed on-wiki using common.css. Without @2, though, so it's serving low-res images at the moment.
[00:34:42] <abian>	 Yes, that was a temporary solution
[00:34:49] <Krinkle>	 And makes it hard to verify for you , but if you're comfortable with the DevTools in your browser, you can simulate it by turning those rules off. Let me know, I can also try to verify it for you instead.
[00:36:22] <logmsgbot>	 !log krinkle@deploy1001 Synchronized static/images/project-logos/: T198761 - Update eswiki logo (duration: 00m 51s)
[00:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:26] <stashbot>	 T198761: Replace logo of the Spanish Wikipedia, which joins the protests against the European copyright directive proposal - https://phabricator.wikimedia.org/T198761
[00:36:29] <abian>	 I guess these changes will take some hours to propagate (I don't know the infrastructure, but that's what happened when I updated the Wikidata logo)
[00:36:44] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:37:14] <Krinkle>	 abian: https://i.imgur.com/fxg0149.png
[00:37:48] <Krinkle>	 that's how I confirmed that it worked via the mwdebug1002 server
[00:37:51] <Krinkle>	 It's now deployed.
[00:37:56] <Krinkle>	 I'll also purge the cache.
[00:37:58] <Krinkle>	 !log Purge https://en.wikipedia.org/static/images/project-logos/eswiki.png
[00:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:02] <Krinkle>	 !log Purge https://en.wikipedia.org/static/images/project-logos/eswiki-1.5x.png
[00:38:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:05] <Krinkle>	 !log Purge https://en.wikipedia.org/static/images/project-logos/eswiki-2x.png
[00:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:27] <abian>	 Yeah, apparently fixed now :)
[00:38:38] <Krinkle>	 abian: For people that have browsed es.wikipedia recently, their browser will still remember the old file for a long time indeed. But for any new visitors it will use the new copy. It is updated in our caches, but we do not control the browser cache :)
[00:38:54] <abian>	 Sure :)
[00:39:00] <abian>	 Thanks, Krinkle!
[00:39:04] <Krinkle>	 yw :)
[00:39:24] <Krinkle>	 Remember to remove the common.css override to avoid browsers downloading the logo twice :)
[00:40:15] <abian>	 {{done}}
[00:42:14] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:46:07] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10Scap, 10Performance-Team (Radar): mwscript emits warning "grep: GREP_OPTIONS is deprecated; please use an alias or script" - https://phabricator.wikimedia.org/T198775 (10Krinkle)
[00:47:45] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:54:24] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:19:49] <Lith>	 is it just me, or does anyone else still see https://meta.wikimedia.org/wiki/Special:UsersWhoWillBeRenamed?
[01:23:54] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on scb1002 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops
[01:24:46] <AntiComposite>	 I see the header for an empty table.
[01:26:04] <Lith>	 huh
[01:26:11] <Lith>	 that shouldn't be there
[01:27:09] <Lith>	 https://phabricator.wikimedia.org/T118637 should have removed it, as $wgCentralAuthEnableUsersWhoWillBeRenamed should default to false
[01:28:24] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:29:25] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:32:01] <Lith>	 it doesn't look to be set via mw-config either
[01:39:24] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:40:25] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:42:10] <Lith>	 o wait, nvm, looks like it didn't get deployed
[01:53:53] <wikibugs>	 (03PS1) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312)
[01:54:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[01:55:49] <wikibugs>	 (03PS2) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312)
[01:58:04] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:04:34] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:35:36] <logmsgbot>	 !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.10) (duration: 14m 57s)
[02:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:37:44] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:43:15] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:45:51] <logmsgbot>	 !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Jul  4 02:45:51 UTC 2018 (duration 10m 16s)
[02:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:54:24] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:59:54] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:05:15] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:06:24] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:17:25] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:18:34] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:24:04] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:29:35] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:40:34] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:58:24] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:59:57] <wikibugs>	 (03PS3) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312)
[04:09:25] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:10:26] <wikibugs>	 (03CR) 10Krinkle: "No-op in puppet compiler for prod, and applies cleanly to beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[04:10:34] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:10:44] <wikibugs>	 (03CR) 10Krinkle: "https://puppet-compiler.wmflabs.org/compiler02/11659/" [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[04:17:50] <wikibugs>	 (03PS1) 10Krinkle: profiler: Enable xenon collection in labs (same as prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443760 (https://phabricator.wikimedia.org/T195312)
[04:19:16] <wikibugs>	 (03CR) 10Krinkle: [C: 032] profiler: Enable xenon collection in labs (same as prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443760 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[04:20:36] <wikibugs>	 (03Merged) 10jenkins-bot: profiler: Enable xenon collection in labs (same as prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443760 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[04:20:41] <wikibugs>	 (03CR) 10jenkins-bot: profiler: Enable xenon collection in labs (same as prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443760 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[04:20:55] <wikibugs>	 (03PS2) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529
[04:21:02] <wikibugs>	 (03PS2) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530
[04:27:46] <wikibugs>	 (03CR) 10Tim Starling: [C: 032] Rewrite sql script to use the new mysql.php wrapper [puppet] - 10https://gerrit.wikimedia.org/r/441153 (owner: 10Tim Starling)
[04:28:11] <wikibugs>	 (03PS4) 10Tim Starling: Rewrite sql script to use the new mysql.php wrapper [puppet] - 10https://gerrit.wikimedia.org/r/441153
[04:29:54] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 430.55 seconds
[04:32:35] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:33:14] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[04:33:53] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762
[04:35:16] <wikibugs>	 (03PS2) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762
[04:38:05] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:43:35] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:46:34] <wikibugs>	 (03PS3) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 (https://phabricator.wikimedia.org/T195312)
[04:49:14] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:50:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072 (10Marostegui) Anything left after repooling the host?
[04:53:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443763 (https://phabricator.wikimedia.org/T191316)
[04:54:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Physically move es1017 from D to C row - https://phabricator.wikimedia.org/T197072 (10jcrespo) 05Open>03Resolved a:03Cmjohnson I don't think so.
[04:55:51] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443763 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[04:57:05] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443763 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[04:57:26] <wikibugs>	 (03PS5) 10Krinkle: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314)
[04:57:28] <wikibugs>	 (03PS4) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312)
[04:57:30] <wikibugs>	 (03PS4) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762
[04:57:32] <wikibugs>	 (03PS1) 10Krinkle: webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764
[04:57:42] <wikibugs>	 (03PS2) 10Krinkle: webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764 (https://phabricator.wikimedia.org/T195312)
[04:57:44] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443763 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[04:59:00] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 for alter table (duration: 00m 52s)
[04:59:01] <marostegui>	 !log Deploy schema change on db1101:3317 T191316 T192926 T89737 T195193 
[04:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:10] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[04:59:11] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[04:59:11] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[04:59:11] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[05:03:04] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766
[05:04:08] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 (owner: 10Jcrespo)
[05:04:26] <wikibugs>	 (03PS1) 10Krinkle: deploymen-prep: Remove mediawiki06 from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/443767 (https://phabricator.wikimedia.org/T192996)
[05:04:38] <wikibugs>	 (03PS2) 10Krinkle: deployment-prep: Remove mediawiki06 from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/443767 (https://phabricator.wikimedia.org/T192996)
[05:28:07] <marostegui>	 !log Optimize recentchanges table on s3 codfw - this will generate lag on codfw s3 - T178290
[05:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:11] <stashbot>	 T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290
[05:34:08] <marostegui>	 !log Optimize recentchanges table on s3 eqiad, host by host T178290
[05:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:12] <stashbot>	 T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290
[05:34:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:36:08] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:41:09] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:52:59] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:58:28] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:59:38] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:00:58] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[06:03:36] <marostegui>	 !log Optimize recentchanges table on s7 codfw - this will generate lag on codfw s7 - T178290
[06:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:40] <stashbot>	 T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290
[06:06:04] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: service::node: Expose the MW appservers' host to modules [puppet] - 10https://gerrit.wikimedia.org/r/443444 (https://phabricator.wikimedia.org/T198461) (owner: 10Mobrovac)
[06:06:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node: Expose the MW appservers' host to modules [puppet] - 10https://gerrit.wikimedia.org/r/443444 (https://phabricator.wikimedia.org/T198461) (owner: 10Mobrovac)
[06:07:28] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[06:14:26] <wikibugs>	 (03CR) 10Elukey: [C: 031] wmf-auto-reimage: use warning log level [puppet] - 10https://gerrit.wikimedia.org/r/443671 (owner: 10Volans)
[06:15:37] <elukey>	 !log reimage aqs1008 to Debian Stretch
[06:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:06] <wikibugs>	 (03CR) 10Tim Starling: Fix phabricator rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4)
[06:21:39] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:23:09] <icinga-wm>	 PROBLEM - tileratorui on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6535: Connection refused
[06:23:39] <icinga-wm>	 PROBLEM - tilerator on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6534: Connection refused
[06:27:09] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:29:04] <icinga-wm>	 PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh]
[06:29:07] <jynus>	 Anyone working with phabricator?
[06:29:14] <jynus>	 puppet error there, very recent
[06:30:23] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/Lets_Encrypt_Authority_X3.crt]
[06:32:03] <icinga-wm>	 PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml]
[06:32:29] <elukey>	 jynus: do you mean on phab1002?
[06:32:40] <jynus>	 yes
[06:32:57] <elukey>	 I think it is still wip, Daniel was working on it before leaving for holidays
[06:33:06] <jynus>	 but the error is recent
[06:33:14] <jynus>	 either something else changed
[06:33:23] <jynus>	 or something I cannot think of
[06:33:24] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:33:53] <elukey>	 yeah you are right, weird
[06:34:39] <elukey>	 jynus: was temporary, just re-run puppet and it went fine
[06:35:08] <jynus>	 then maybe puppetmaster temporary issues
[06:35:53] <icinga-wm>	 RECOVERY - tileratorui on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.091 second response time
[06:37:31] <icinga-wm>	 RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:43:57] <_joe_>	 !log restarting tilerator on maps-test2004, to check if it can recover
[06:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:41] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:53:44] <_joe_>	 I think that's the puppetmaster logrotating
[06:53:51] <_joe_>	 the puppet failures above
[06:54:01] <_joe_>	 I've seen them repeatedly around 6:30
[06:55:20] <icinga-wm>	 PROBLEM - Host kubernetes2003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:55:50] <icinga-wm>	 RECOVERY - Host kubernetes2003 is UP: PING WARNING - Packet loss = 86%, RTA = 471.89 ms
[06:56:01] <icinga-wm>	 RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.075 second response time
[06:57:10] <icinga-wm>	 RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:42] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-registry.discovery.wmnet/calico/node]
[07:00:33] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:07:42] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:07:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: redis: remove cronjob for restarts on slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/443772 (https://phabricator.wikimedia.org/T191316)
[07:07:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: redis: remove now-useless specific classes [puppet] - 10https://gerrit.wikimedia.org/r/443773 (https://phabricator.wikimedia.org/T191316)
[07:09:32] <elukey>	 \o/
[07:09:43] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:11:12] <_joe_>	 elukey: I'm not done :P
[07:14:02] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:18:57] <akosiaris>	 !log reimage kubernetes100{1,2}.eqiad.wmnet kubernetes200{1,2}.codfw.wmnet without swap
[07:18:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:38] <moritzm>	 !log installing imagemagick security updates
[07:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:55] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={container_status,create_container,list_containers,list_podsandbox,podsandbox_status,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:30:55] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:33:47] <akosiaris>	 that's  ^ expected
[07:33:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, let's coordinate on when to merge this" [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans)
[07:36:02] <wikibugs>	 (03CR) 10Hashar: [C: 032] ":]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443475 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian)
[07:36:11] <hashar>	 ^^ labs / beta only
[07:36:36] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:37:23] <wikibugs>	 (03Merged) 10jenkins-bot: Give a name to en-rtl wiki in Special:SiteMatrix in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443475 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian)
[07:37:36] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:38:06] <hashar>	 !log rebased /srv/mediawiki-staging on deploy1001 for beta cluster only change https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/443475/  
[07:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:11] <wikibugs>	 (03CR) 10jenkins-bot: Give a name to en-rtl wiki in Special:SiteMatrix in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443475 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian)
[07:42:14] <wikibugs>	 (03PS4) 10Ema: cache::text: ship cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/440157 (https://phabricator.wikimedia.org/T164609)
[07:42:57] <wikibugs>	 (03CR) 10Ema: [C: 032] cache::text: ship cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/440157 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema)
[07:43:05] <moritzm>	 !log resuming rolling restart of cassandra on restbase hosts in eqiad to pick up OpenJDK security update
[07:43:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443780
[07:46:22] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:48:00] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443780 (owner: 10Marostegui)
[07:49:13] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443780 (owner: 10Marostegui)
[07:49:22] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:49:29] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443780 (owner: 10Marostegui)
[07:50:57] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 after alter table (duration: 00m 53s)
[07:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Use backports version of osm2pgsql on Stretch for improved memory handling [puppet] - 10https://gerrit.wikimedia.org/r/443668 (https://phabricator.wikimedia.org/T198485) (owner: 10Mholloway)
[07:53:39] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443782 (https://phabricator.wikimedia.org/T191316)
[07:53:58] <volans>	 !log reimaging silver (spare host, to-be-decomm'ed) as testing host for the reimage script
[07:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:47] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443782 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[07:57:47] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Gehel) For WDQS, we should keep access to at least 2 nodes in both eqiad and codfw. I propose:  wdqs1003: 10.64.0.14 wdqs1004: 10.64.0.17 wdqs2001: 10.192.32.1...
[07:58:02] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443782 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[07:58:31] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443782 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[07:59:26] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 for alter table (duration: 00m 50s)
[07:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:16] <marostegui>	 !log Deploy schema change on db1078 T191316 T192926 T89737 T195193 
[08:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:21] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[08:00:22] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[08:00:22] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[08:00:23] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[08:05:34] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mathoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.20 and port 10042: Connection refused
[08:05:42] <akosiaris>	 uh oh
[08:05:51] <akosiaris>	 I might have been too eager here, looking
[08:06:13] <godog>	 mathoid threw its toys out of the pram?
[08:06:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled
[08:06:26] <akosiaris>	 ah ok explainable, my mistake
[08:06:40] <akosiaris>	 I should have kept the pace at 1 host at a time
[08:06:50] <apergos>	 ah ha
[08:06:52] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled
[08:06:56] <akosiaris>	 should fix itself really quick
[08:07:03] <vgutierrez>	 ok...
[08:08:20] <akosiaris>	 kubernetes is already creating the containers on kubernetes2003, kubernetes2004
[08:09:47] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mathoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.076 second response time
[08:09:55] <moritzm>	 !log rebooting multatuli for kernel update to 4.9.107~wmf1
[08:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:02] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet])
[08:10:02] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet])
[08:10:03] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) First batch of changes:  ``` delete firewall family inet filter analytics-in4 term logstash delete firewall family inet filter analytics-in4 term event...
[08:10:07] <akosiaris>	 ok fixed. 
[08:10:22] <volans>	 I got only the recovery page so far
[08:10:27] <akosiaris>	 4 mins of outage. But it self-healed without me having to do anything, so that's something
[08:10:54] <akosiaris>	 I got both
[08:10:54] <volans>	 and now I got the problem one, reverse order
[08:11:09] <akosiaris>	 so 4 mins of delay ?
[08:11:14] <volans>	 yep :(
[08:11:28] <akosiaris>	 it's probably on your carrier
[08:11:30] <akosiaris>	 I got them just fine
[08:11:35] <volans>	 ack
[08:11:39] <ema>	 yes, me too
[08:11:40] <vgutierrez>	 I got both of them in the proper order... the LVS HTTP IPv4 on mathoid....
[08:11:53] <ema>	 akosiaris: what was the thing that should have happened one host at a time, out of curiosity?
[08:11:55] <vgutierrez>	 some italian issue
[08:11:57] <vgutierrez>	 :P
[08:12:17] <jynus>	 just came back
[08:12:22] <vgutierrez>	 ema: reimaging hosts I believe
[08:12:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: add graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443785 (https://phabricator.wikimedia.org/T196483)
[08:12:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy
[08:12:44] <jynus>	 do you already know what was the issue?
[08:12:56] <akosiaris>	 ema: reimaging. So TL;DR is that I had reimaged yesterday kubernetes2003,4 successfully and in sequence (passing --sequential). Then today I became a bit more bold and did more hosts (kubernetes2001,2) in parallel
[08:13:09] <akosiaris>	 but from the reimaging of yesterday all the pods were scheduled on those 2 hosts
[08:13:17] <akosiaris>	 so I effectively killed all pods at the same time
[08:13:22] <akosiaris>	 chaos monkey ftw
[08:13:27] <jynus>	 is mathoid on kubenetes?
[08:13:30] <akosiaris>	 yup
[08:13:37] <jynus>	 ah, sorry, wasn't understanding
[08:13:37] <wikibugs>	 (03PS1) 10Elukey: role::eventlogging::analytics::zeromq: delete unused role [puppet] - 10https://gerrit.wikimedia.org/r/443787
[08:13:43] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet operation_type={container_status,create_container,image_status,list_containers,list_podsandbox,podsandbox_status,pull_image,run_podsandbox,start_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:13:44] <jynus>	 I though that was unrelated
[08:14:18] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::eventlogging::analytics::zeromq: delete unused role [puppet] - 10https://gerrit.wikimedia.org/r/443787 (owner: 10Elukey)
[08:14:21] <akosiaris>	 again, the good thing is this recovered from double hardware issue in its own without any action on my part in 4 mins
[08:14:32] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet operation_type={create_container,image_status,list_containers,list_podsandbox,podsandbox_status,pull_image,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:14:33] <akosiaris>	 so at least that's something
[08:14:37] <ema>	 nice, yes :)
[08:14:42] <akosiaris>	 the latencies are expected 
[08:14:48] <volans>	 they got scheduled on different pods?
[08:14:50] <akosiaris>	 after thing I am gonna raise finally those threshold
[08:14:52] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:14:52] <jynus>	 mathoid should not have a lot of impact on end users unles being down for a long time, doesn't it?
[08:14:58] <akosiaris>	 volans: you mean nodes, but yes
[08:15:02] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal
[08:15:02] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal
[08:15:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy
[08:15:04] <volans>	 yeah, sorry
[08:15:10] <akosiaris>	 jynus: overall ? yes that's correct
[08:15:14] <volans>	 nice :)
[08:15:24] <jynus>	 it should only kick on rendering, I guess?
[08:15:32] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:15:34] <jynus>	 but I don't know much about that level of the stack
[08:15:35] <akosiaris>	 yes and then the result is stored in restbase
[08:15:44] <jynus>	 so no big deal
[08:15:55] <akosiaris>	 yeah we probably lost a few renders
[08:15:59] <akosiaris>	 in fact I can calculate them
[08:16:00] <jynus>	 that is ok
[08:16:18] <jynus>	 I guess they also get a rerender on next display
[08:16:20] <akosiaris>	 and even then, mediawiki will retry IIRC
[08:16:25] <jynus>	 yeah
[08:17:07] <jynus>	 It would be nice to even get a render with "this part failed", but that may not be possible or easy (or it already happens)
[08:17:07] <akosiaris>	 so current rate is around 0.1 req/s
[08:17:48] <akosiaris>	 so ~24 initially failed render requests
[08:17:53] <akosiaris>	 https://grafana.wikimedia.org/dashboard/db/service-mathoid?panelId=8&fullscreen&orgId=1&from=now-15m&to=now
[08:18:26] <jynus>	 sorry I wasn't around, I was in a short break
[08:19:31] <marostegui>	 !log Optimize recentchanges table on s6 codfw - this will generate lag on codfw s6 - T178290
[08:19:32] <akosiaris>	 I 'll do an incident reponse
[08:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:34] <stashbot>	 T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290
[08:19:57] <jynus>	 marostegui: ok to go with my reimage plan?
[08:20:37] <jynus>	 I was going to do db2038, db2047 and db2052
[08:23:24] <marostegui>	 let me seee
[08:23:39] <marostegui>	 jynus: yep, all good
[08:23:46] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: add graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443785 (https://phabricator.wikimedia.org/T196483)
[08:23:50] <jynus>	 ok, starting
[08:24:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] graphite: add graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443785 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi)
[08:29:09] <icinga-wm>	 PROBLEM - puppet last run on graphite2003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 8 seconds ago with 5 failures. Failed resources (up to 3 shown)
[08:29:38] <icinga-wm>	 PROBLEM - Check systemd state on graphite2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:30:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: redis: remove cronjob for restarts on slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/443772 (https://phabricator.wikimedia.org/T198220)
[08:30:28] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: redis: remove now-useless specific classes [puppet] - 10https://gerrit.wikimedia.org/r/443773 (https://phabricator.wikimedia.org/T198220)
[08:30:30] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: site.pp: move slave redises to system::spare [puppet] - 10https://gerrit.wikimedia.org/r/443789 (https://phabricator.wikimedia.org/T198220)
[08:31:38] <icinga-wm>	 RECOVERY - Check systemd state on graphite2003 is OK: OK - running: The system is fully operational
[08:31:55] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: redis: remove cronjob for restarts on slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/443772 (https://phabricator.wikimedia.org/T198220)
[08:34:38] <akosiaris>	 https://wikitech.wikimedia.org/wiki/Incident_documentation/20180704-mathoid
[08:35:18] <icinga-wm>	 PROBLEM - MD RAID on kubernetes1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:35:18] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:35:18] <icinga-wm>	 PROBLEM - configured eth on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:36:49] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:36:49] <icinga-wm>	 PROBLEM - dhclient process on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:37:00] <akosiaris>	 I should manually trigger a rebalancing
[08:38:28] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:40:08] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:40:08] <icinga-wm>	 PROBLEM - configured eth on kubernetes1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:40:08] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:40:08] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:40:19] <icinga-wm>	 RECOVERY - MD RAID on kubernetes1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[08:40:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] redis: remove cronjob for restarts on slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/443772 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto)
[08:41:09] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1002 is OK: OK: nf_conntrack is 0 % full
[08:41:09] <icinga-wm>	 RECOVERY - configured eth on kubernetes1002 is OK: OK - interfaces up
[08:41:44] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443790
[08:41:48] <icinga-wm>	 PROBLEM - DPKG on kubernetes2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[08:43:04] <icinga-wm>	 RECOVERY - DPKG on kubernetes2001 is OK: All packages OK
[08:43:23] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2001 is OK: OK ferm input default policy is set
[08:43:34] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes2001 is OK: OK: nf_conntrack is 0 % full
[08:43:34] <icinga-wm>	 RECOVERY - configured eth on kubernetes2001 is OK: OK - interfaces up
[08:43:44] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443790 (owner: 10Marostegui)
[08:45:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443790 (owner: 10Marostegui)
[08:47:20] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 after alter table (duration: 00m 50s)
[08:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:05] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443790 (owner: 10Marostegui)
[08:49:34] <icinga-wm>	 RECOVERY - dhclient process on kubernetes2001 is OK: PROCS OK: 0 processes with command name dhclient
[08:49:49] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443791 (https://phabricator.wikimedia.org/T191316)
[08:51:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mw-maintenance: switch to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092)
[08:51:20] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443791 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[08:51:30] <_joe_>	 akosiaris, apergos ^^ let's do it?
[08:51:41] <apergos>	 ahhh!
[08:51:46] <_joe_>	 and send an announcement about turning off terbium on monday?
[08:51:57] <akosiaris>	 _joe_: +1
[08:52:38] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443791 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[08:52:53] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443791 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[08:53:08] <elukey>	 !log update analytics-in4 filter rules on cr1/cr2 eqiad - T198623
[08:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:12] <stashbot>	 T198623: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623
[08:53:50] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 for alter table (duration: 00m 50s)
[08:53:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:03] <marostegui>	 !log Deploy schema change on db1123 T191316 T192926 T89737 T195193 
[08:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:09] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[08:54:09] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[08:54:10] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[08:54:10] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[08:54:50] <apergos>	 how do we move those cron jobs he was talking about?
[08:56:21] <moritzm>	 !log uploaded linux 4.9.107~wmf1 for jessie-wikimedia to apt.wikimedia.org
[08:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:11] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: create log files during log rotation [puppet] - 10https://gerrit.wikimedia.org/r/443583 (owner: 10Gehel)
[08:57:25] <wikibugs>	 (03PS3) 10Gehel: wdqs: create log files during log rotation [puppet] - 10https://gerrit.wikimedia.org/r/443583
[08:58:18] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet operation_type={podsandbox_status,remove_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:58:38] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet operation_type={podsandbox_status,remove_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:00:02] <akosiaris>	 !log manually rebalance the mathoid kubernetes production cluster namespaces pods wise
[09:00:03] <apergos>	 I see... he left some patchsets already prepped for us
[09:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:18] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:00:22] <apergos>	 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441381/  and then https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441346/
[09:00:47] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:00:55] <akosiaris>	 yeah, _joe_ kind of merged them from what I gather
[09:00:58] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type=container_status https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:01:13] <_joe_>	 yes
[09:01:58] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:02:24] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: redis: remove now-useless specific classes [puppet] - 10https://gerrit.wikimedia.org/r/443773 (https://phabricator.wikimedia.org/T198220)
[09:02:38] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational
[09:03:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] redis: remove now-useless specific classes [puppet] - 10https://gerrit.wikimedia.org/r/443773 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto)
[09:03:47] <marostegui>	 !log Optimize recentchanges table on s2 codfw - this will generate lag on codfw s2 - T178290
[09:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:50] <stashbot>	 T178290: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290
[09:04:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 (owner: 10Jcrespo)
[09:04:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump meta package for new ABI in 4.9.107 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/443794
[09:05:07] <apergos>	 +1 from me
[09:05:48] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 (owner: 10Jcrespo)
[09:07:38] <apergos>	 _joe_: ^^
[09:07:56] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2038 and db2047 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443766 (owner: 10Jcrespo)
[09:08:37] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2001 is OK: OK: synced at Wed 2018-07-04 09:08:29 UTC.
[09:08:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Bump meta package for new ABI in 4.9.107 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/443794 (owner: 10Muehlenhoff)
[09:09:28] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2038, db2047 (duration: 00m 50s)
[09:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:18] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[09:12:22] <akosiaris>	 volans: I am done with the reimages
[09:12:53] <volans>	 akosiaris: thanks, I'm testing the code anyway with parallel files to not hurt anyone's work ;)
[09:13:21] <jynus>	 volans: so I guess you are doing deploy now?
[09:13:53] <volans>	 jynus: no, still testing, and need to fine tune a couple of things for edge cases, feel free to proceed with your reimages
[09:14:07] <wikibugs>	 (03CR) 10Addshore: [C: 031] "Per https://www.mediawiki.org/wiki/Manual:$wgRightsPage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle)
[09:14:22] <volans>	 I will not merge before checking with you all ;)
[09:15:25] <wikibugs>	 (03PS2) 10Gehel: Use backports version of osm2pgsql on Stretch for improved memory handling [puppet] - 10https://gerrit.wikimedia.org/r/443668 (https://phabricator.wikimedia.org/T198485) (owner: 10Mholloway)
[09:15:36] <elukey>	 !log reimage aqs1009 to Debian Stretch
[09:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:52] <jynus>	 volans: will do 2 reimages then with whatever version we have now
[09:16:04] <volans>	 ack
[09:16:04] <moritzm>	 !log uploaded linux-meta 1.18 for jessie-wikimedia to apt.wikimedia.org
[09:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:36] <wikibugs>	 (03CR) 10Gehel: [C: 032] Use backports version of osm2pgsql on Stretch for improved memory handling [puppet] - 10https://gerrit.wikimedia.org/r/443668 (https://phabricator.wikimedia.org/T198485) (owner: 10Mholloway)
[09:16:52] <icinga-wm>	 ACKNOWLEDGEMENT - Long running screen/tmux on lawrencium is CRITICAL: NRPE: Command check_check_long_procs not defined Jcrespo Host to be decomm. https://phabricator.wikimedia.org/T191360
[09:18:31] <volans>	 hashar: you around by any chance?
[09:21:13] <wikibugs>	 (03CR) 10Addshore: [C: 031] "+1 to the move and removing things from these Wikibase* files." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle)
[09:23:51] <wikibugs>	 (03CR) 10Addshore: [C: 031] Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 (owner: 10Krinkle)
[09:24:15] <hashar>	 volans: yes I am :)
[09:24:38] <volans>	 hashar: how can I see what a couple of CI jobs execute?
[09:24:54] <volans>	 I'm interested in operations-dns-lint and operations-dns-tabs
[09:25:16] <volans>	 apparently I don't have permissions to see the configuration of the jobs ;)
[09:25:17] <wikibugs>	 (03CR) 10Addshore: [C: 031] Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 (owner: 10Krinkle)
[09:27:03] <hashar>	 volans: hmm yeah maybe ops dont have access
[09:27:16] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hieradata: profile::openstack::eqiad1::neutron::metadata_proxy_shared_secret [labs/private] - 10https://gerrit.wikimedia.org/r/443796 (https://phabricator.wikimedia.org/T196633)
[09:27:38] <wikibugs>	 (03CR) 10Addshore: [C: 031] Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535 (owner: 10Krinkle)
[09:27:42] <hashar>	 volans: for operations-dns-lint that is https://github.com/wikimedia/integration-config/blob/master/jjb/operations-misc.yaml#L22-L26
[09:27:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] hieradata: profile::openstack::eqiad1::neutron::metadata_proxy_shared_secret [labs/private] - 10https://gerrit.wikimedia.org/r/443796 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[09:28:00] <hashar>	 which uses a script 'authdns-lint' shipped via puppet
[09:28:04] <volans>	 ok
[09:28:25] <hashar>	 the -tabs job, it is a nasty find|xargs|grep something
[09:28:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: update graphite-auth for django 1.9 [puppet] - 10https://gerrit.wikimedia.org/r/443797
[09:28:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483)
[09:28:46] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: labsdb1006: Reimage as stretch and make it osm::master [puppet] - 10https://gerrit.wikimedia.org/r/443799 (https://phabricator.wikimedia.org/T197246)
[09:28:54] <hashar>	 volans: https://github.com/wikimedia/integration-config/blob/17bb03adb5276b70ad87aadbfe6f31143cbd50e2/jjb/job-templates.yaml#L234-L247
[09:28:58] <volans>	 ok, so we don't have any easy 'entry point' for adding stuff there
[09:29:04] <akosiaris>	 !log upload kubernetes 1.8.14 to apt.wikimedia.org/stretch-wikimedia/main
[09:29:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:29] <volans>	 thanks for the pointers to the code :)
[09:29:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi)
[09:31:13] <wikibugs>	 (03CR) 10Addshore: [C: 031] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle)
[09:32:20] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mw-maintenance: switch to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092)
[09:32:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092)
[09:33:04] <_joe_>	 ok apergos, akosiaris I'm going on with the first change, can you take a look at the decom notice?
[09:34:52] <wikibugs>	 (03CR) 10Addshore: [C: 031] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle)
[09:35:18] <akosiaris>	 it's wrong fwiw
[09:35:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483)
[09:35:31] <wikibugs>	 (03CR) 10Addshore: [C: 031] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 (owner: 10Krinkle)
[09:35:34] <akosiaris>	 it's applied on the wrong host. The decom motd::script I mean
[09:35:59] <apergos>	 it will be?
[09:36:04] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova common profile [puppet] - 10https://gerrit.wikimedia.org/r/443802 (https://phabricator.wikimedia.org/T196633)
[09:37:29] <akosiaris>	 well it says This server will be decommissioned on July 9th; please use
[09:37:30] <akosiaris>	 mwmaint1001.eqiad.wmnet instead.
[09:37:44] <akosiaris>	 so I am assuming this means it is to be applied to terbium
[09:37:51] <akosiaris>	 I 'll upload a change
[09:39:51] <jynus>	 yes, mwmaint1001 is the final one, but AFAIK, terbium is not yet 100% unused
[09:39:53] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova common profile [puppet] - 10https://gerrit.wikimedia.org/r/443802 (https://phabricator.wikimedia.org/T196633)
[09:39:57] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mw-maintenance: switch to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092)
[09:40:12] <_joe_>	 akosiaris: yeah sigh
[09:40:18] <_joe_>	 I added it in the wrong place :D
[09:40:19] <akosiaris>	 ok fixed
[09:40:37] <apergos>	 i'll send mail to ops, engineering
[09:40:45] <_joe_>	 apergos: oh thanks
[09:40:50] <_joe_>	 maybe wikitech-l too?
[09:40:58] <_joe_>	 wmde people have access as well
[09:41:07] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: mw-maintenance: switch to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto)
[09:41:09] <_joe_>	 either we just send an email to everyone who has access
[09:41:09] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto)
[09:41:21] <_joe_>	 argh you overwritten my ps3
[09:41:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 031] "I thought I did this. Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443686 (https://phabricator.wikimedia.org/T198768) (owner: 10Sbisson)
[09:42:05] <hasharAway>	 volans: I am off, but potentially we could craft a job that clones  operations/puppet for the authdnslint script,  and uses an entry point at the root of operations/dns.git (eg using make or a shell script or whatever)
[09:42:18] <hasharAway>	 volans: I am off though, but we can talk about it tomorrow morning 
[09:42:39] <volans>	 hasharAway: sure, no hurry
[09:43:18] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mw-maintenance: switch to mwmaint1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092)
[09:43:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto)
[09:43:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mw-maintenance: switch to mwmaint1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/443792 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto)
[09:44:31] <_joe_>	 !log stopping all cronjobs via a puppet run on terbium, T192092
[09:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:34] <stashbot>	 T192092: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092
[09:46:57] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova common profile [puppet] - 10https://gerrit.wikimedia.org/r/443802 (https://phabricator.wikimedia.org/T196633)
[09:47:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is good:" [puppet] - 10https://gerrit.wikimedia.org/r/443802 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[09:47:35] <wikibugs>	 (03PS1) 10Elukey: Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443805
[09:47:52] <wikibugs>	 (03Abandoned) 10Elukey: Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443805 (owner: 10Elukey)
[09:48:03] <wikibugs>	 (03PS1) 10Elukey: Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443806
[09:48:10] <wikibugs>	 (03PS2) 10Elukey: Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443806
[09:48:49] <wikibugs>	 (03CR) 10Elukey: [C: 032] Revert "netboot.cfg: temp remove aqs hosts to allow manual work during d-i" [puppet] - 10https://gerrit.wikimedia.org/r/443806 (owner: 10Elukey)
[09:51:39] <apergos>	 I'd rather spam the three lists than try targetted email. draft is ready to go when the switch is complete and things look stable
[09:52:17] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092)
[09:52:21] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807
[09:53:01] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: update graphite-auth for django 1.9 [puppet] - 10https://gerrit.wikimedia.org/r/443797
[09:53:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] Add IPv6 records for authdns2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/443580 (https://phabricator.wikimedia.org/T196664) (owner: 10Vgutierrez)
[09:53:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] graphite: update graphite-auth for django 1.9 [puppet] - 10https://gerrit.wikimedia.org/r/443797 (owner: 10Filippo Giunchedi)
[09:53:17] <wikibugs>	 (03PS2) 10Vgutierrez: Add IPv6 records for authdns2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/443580 (https://phabricator.wikimedia.org/T196664)
[09:53:20] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807
[09:54:13] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807
[09:54:49] <_joe_>	 meh, if we only didn't use ff-only...
[09:54:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto)
[09:54:56] <apergos>	 :-)
[09:55:09] <_joe_>	 I hate I lose ~ 30 mins/week to this
[09:55:19] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092)
[09:55:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] terbium: Add a decommission notice. [puppet] - 10https://gerrit.wikimedia.org/r/443801 (https://phabricator.wikimedia.org/T192092) (owner: 10Giuseppe Lavagetto)
[09:59:51] <icinga-wm>	 RECOVERY - puppet last run on graphite2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[10:01:20] <moritzm>	 !log rolling reboot of sca* for "lazy fpu" kernel updates
[10:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: site.pp: change priority of the motd, fix script [puppet] - 10https://gerrit.wikimedia.org/r/443808
[10:07:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: change priority of the motd, fix script [puppet] - 10https://gerrit.wikimedia.org/r/443808 (owner: 10Giuseppe Lavagetto)
[10:09:31] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[10:11:33] <_joe_>	 uh? ^^
[10:12:27] <_joe_>	 that's codfw, expected?
[10:13:44] <ema>	 _joe_: it seems there's been a spike in traffic and then a drop to normal levels
[10:13:54] <_joe_>	 yeah I was looking as well
[10:14:05] <_joe_>	 interesting we only alert on the drop :D
[10:14:05] <apergos>	 same
[10:14:33] <apergos>	 I guess we won't alert on a spike unless it's big enough to put us out of business
[10:14:59] <wikibugs>	 (03PS3) 10Filippo Giunchedi: graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483)
[10:15:08] <ema>	 yeah the idea behind that alert is that we want to know when a DC gets a suspiciously low amount of traffic (because eg. router drama) 
[10:15:11] <_joe_>	 the drop in codfw is kinda strange though
[10:15:19] <_joe_>	 ema: yeah I get the reasoning
[10:15:21] <wikibugs>	 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10jcrespo) Sorry, you didn't understood what I meant- for ORES, it was: T159753 and for translation, T183485, both as sum...
[10:16:16] <ema>	 the drop is text-only it seems
[10:16:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] graphite: fix for stretch/django 1.9 compat [puppet] - 10https://gerrit.wikimedia.org/r/443798 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi)
[10:16:58] <ema>	 however, in line with last week: https://grafana.wikimedia.org/dashboard/db/varnish-caching-last-week-comparison?refresh=15m&orgId=1&var-cluster=text&var-site=codfw&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5
[10:17:25] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443809
[10:21:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: fix require_package [puppet] - 10https://gerrit.wikimedia.org/r/443810 (https://phabricator.wikimedia.org/T196483)
[10:22:11] <icinga-wm>	 PROBLEM - puppet last run on graphite2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:22:28] <godog>	 known ^
[10:22:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] graphite: fix require_package [puppet] - 10https://gerrit.wikimedia.org/r/443810 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi)
[10:22:43] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: fix require_package [puppet] - 10https://gerrit.wikimedia.org/r/443810 (https://phabricator.wikimedia.org/T196483)
[10:23:40] <wikibugs>	 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10awight) @jcrespo  I see, well in this case content storage is exactly what we're planning to use.  Is there anything sp...
[10:23:42] <icinga-wm>	 PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:24:11] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443809 (owner: 10Marostegui)
[10:25:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443809 (owner: 10Marostegui)
[10:26:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633)
[10:26:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[10:27:03] <wikibugs>	 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10jcrespo) Ok, that is much better, but I guess it still would double the revision table (or the 5 new tables that are to...
[10:27:12] <icinga-wm>	 RECOVERY - puppet last run on graphite2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[10:27:46] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443809 (owner: 10Marostegui)
[10:28:05] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 after alter table (duration: 00m 50s)
[10:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:12] <icinga-wm>	 PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:29:31] <icinga-wm>	 PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:13] <apergos>	 _joe_:  do we want to manually edit the www-data crontab on terbium so no new jobs will start?
[10:33:38] <_joe_>	 apergos: those left are just silver leftovers
[10:33:43] <_joe_>	 they don't do anything
[10:33:50] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807 (owner: 10Jcrespo)
[10:33:52] <_joe_>	 but yeah, be my guest :)
[10:33:53] <moritzm>	 godog: modules/profile/manifests/labs/monitoring.pp declared mod-uwsgi with a plain package, so puppet trips over the duplicate declaration
[10:33:58] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Allow reimage to stretch of db2047, db2038 [puppet] - 10https://gerrit.wikimedia.org/r/443807
[10:34:26] <godog>	 moritzm: sigh, thanks
[10:35:30] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443815 (https://phabricator.wikimedia.org/T191316)
[10:35:41] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[10:37:00] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443815 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[10:38:15] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443815 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[10:38:30] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443815 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[10:38:36] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633)
[10:39:01] <icinga-wm>	 RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[10:39:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[10:39:59] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 for alter table (duration: 00m 50s)
[10:40:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:21] <marostegui>	 !log Stop replication on db1077 to drop triggers on db1124:3313 - T192926
[10:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:25] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[10:43:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: labs: move libapache2-mod-uwsgi to graphite::web [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483)
[10:44:22] <godog>	 moritzm arturo ^
[10:44:28] <moritzm>	 looking
[10:44:42] <arturo>	 godog: ?
[10:45:08] <arturo>	 (I ignore wikibugs, if that's what you are pointing to)
[10:45:36] <godog>	 arturo: ah yeah, I was pointing to https://gerrit.wikimedia.org/r/c/operations/puppet/+/443816
[10:46:09] <godog>	 FWIW I have traffic bot show up as NOTICEs not as PRIVMSGs as it should be
[10:46:28] <godog>	 helps with telling the two apart
[10:46:34] <marostegui>	 !log Deploy schema change on db1077 with replication, this will generate lag on labs s3 T191316 T192926 T89737 T195193 
[10:46:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:40] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[10:46:41] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[10:46:41] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[10:48:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good (ideally the remaining package declarations were also switched to require_package, which would have prevented that error also)," [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi)
[10:48:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] "I would say run this with puppet catalog compiler to make sure it wont break due to the package being required elsewhere (ordering, whatev" [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi)
[10:49:58] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633)
[10:53:54] <jynus>	 !log stop db2038 and db2047
[10:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:56] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova conductor profile [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633)
[10:58:45] <godog>	 !log update compiler facts
[10:58:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:31] <arturo>	 godog: labmon1001.eqiad.wmnet instead of .wikimedia.org ?
[11:00:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is good to go:" [puppet] - 10https://gerrit.wikimedia.org/r/443811 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[11:00:55] <godog>	 arturo: yeah that was it, then I realized I need new facts for graphite2003 anyway
[11:01:06] <arturo>	 great
[11:01:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/11663/" [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi)
[11:01:33] <arturo>	 thanks BTW godog :-)
[11:01:39] <wikibugs>	 (03PS2) 10Filippo Giunchedi: labs: move libapache2-mod-uwsgi to graphite::web [puppet] - 10https://gerrit.wikimedia.org/r/443816 (https://phabricator.wikimedia.org/T196483)
[11:01:48] <godog>	 hehe no problem arturo ! simple enough fix
[11:05:53] <godog>	 mhh puppet-facts-export fails with KeyError: 'trusted'
[11:05:59] <godog>	 will take a look after lunch
[11:07:53] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:08:03] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on cp3048 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T198784
[11:08:42] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:09:28] <arturo>	 oh ^^^^ these puppet errors seems my fault
[11:09:37] <mark>	 !log cp3043: mdadm /dev/md0 -- fail /dev/sdb1
[11:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:21] <icinga-wm>	 RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:13:33] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on cp3043 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T198785
[11:14:13] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:15:17] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues cpettet Ok
[11:17:18] <mark>	 !log cp3043: mdadm /dev/md0 --add /dev/sdc1 (sdc is former cp3048:sdb)
[11:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:22] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues cpettet Ok
[11:18:37] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: pass version hiera key down to nova conductor service class [puppet] - 10https://gerrit.wikimedia.org/r/443817
[11:22:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler seems OK:" [puppet] - 10https://gerrit.wikimedia.org/r/443817 (owner: 10Arturo Borrero Gonzalez)
[11:27:33] <marostegui>	 jouncebot: next
[11:27:33] <jouncebot>	 In 25 hour(s) and 32 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180705T1300)
[11:27:53] <moritzm>	 !log resuming rolling restart of cassandra on restbase hosts in eqiad completed
[11:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:58] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443818
[11:28:02] <moritzm>	 !log rolling restart of cassandra on restbase hosts in eqiad completed
[11:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:07] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:29:15] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:29:27] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:29:27] <icinga-wm>	 RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:29:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add nova scheduler profile [puppet] - 10https://gerrit.wikimedia.org/r/443819 (https://phabricator.wikimedia.org/T196633)
[11:35:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: add nova scheduler profile [puppet] - 10https://gerrit.wikimedia.org/r/443819 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[11:35:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is OK: https://puppet-compiler.wmflabs.org/compiler02/11668/" [puppet] - 10https://gerrit.wikimedia.org/r/443819 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[11:38:45] <icinga-wm>	 PROBLEM - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[11:39:35] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 132 not-conn: cp3034_v4, cp3034_v6, cp3048_v4, cp3048_v6
[11:40:06] <icinga-wm>	 PROBLEM - Host cp3048.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[11:40:06] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 132 not-conn: cp3034_v4, cp3034_v6, cp3048_v4, cp3048_v6 Ema hw maintenance
[11:40:40] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp3048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ema hw maintenance
[11:40:40] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ema hw maintenance
[11:44:05] <icinga-wm>	 RECOVERY - Host cp3034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.32 ms
[11:52:28] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443818 (owner: 10Marostegui)
[11:53:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443818 (owner: 10Marostegui)
[11:54:54] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 after alter table (duration: 00m 52s)
[11:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:25] <icinga-wm>	 RECOVERY - Host cp3048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.33 ms
[11:58:02] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443818 (owner: 10Marostegui)
[12:04:08] <moritzm>	 !log installing ruby 1.9 security updates on trusty
[12:04:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:11] <wikibugs>	 (03CR) 10Aklapper: "Tim Starling: See T198570" [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4)
[12:19:39] <wikibugs>	 10Operations, 10ops-esams: Relabel hooft to bast3002 - https://phabricator.wikimedia.org/T198790 (10mark)
[12:25:55] <icinga-wm>	 PROBLEM - Host snapshot1005 is DOWN: PING CRITICAL - Packet loss = 100%
[12:26:01] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-api [puppet] - 10https://gerrit.wikimedia.org/r/443824 (https://phabricator.wikimedia.org/T196633)
[12:27:27] <jynus>	 I don't see any log about restarting snapshot1005
[12:27:53] <jynus>	 apergos or arturo ^ someting you may be aware of?
[12:28:07] <apergos>	 no
[12:28:19] <jynus>	 so either crash or network loss
[12:28:25] <jynus>	 checking
[12:28:50] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-api [puppet] - 10https://gerrit.wikimedia.org/r/443824 (https://phabricator.wikimedia.org/T196633)
[12:29:02] <arturo>	 no idea
[12:31:01] <jynus>	 "The server is not powered on"
[12:31:12] <apergos>	 Server Power: Off
[12:31:16] <apergos>	 that's nice
[12:31:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is good for:" [puppet] - 10https://gerrit.wikimedia.org/r/443824 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[12:31:59] <apergos>	 jynus: I am on the mgmt interface
[12:32:08] <apergos>	 going to power it back up, ok?
[12:32:09] <volans>	 !log shutting down bast3002 for disk replacement
[12:32:10] <jynus>	 I will let you handle
[12:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:13] <apergos>	 cool
[12:32:14] <jynus>	 ok for me
[12:32:18] <volans>	 (use a different bastion for few minutes)
[12:33:11] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443825 (https://phabricator.wikimedia.org/T197069)
[12:36:17] <apergos>	 nice
[12:36:25] <apergos>	 doesn't want to power back on
[12:36:37] <jynus>	 I would check the lifecycle log
[12:36:47] <jynus>	 maybe a power failure was detected
[12:37:08] <jynus>	 or something else replacable
[12:40:37] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on bast3002 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T198791
[12:41:36] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443825 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui)
[12:42:29] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T198791 (10Volans)
[12:42:31] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T183814 (10Volans)
[12:42:53] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443825 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui)
[12:43:08] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443825 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui)
[12:46:23] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1089 for maintenance - T197069 (duration: 02m 57s)
[12:46:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:26] <stashbot>	 T197069: Failover db1052 (s1) db primary master - https://phabricator.wikimedia.org/T197069
[12:51:56] <wikibugs>	 (03PS1) 10Marostegui: db1089.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/443826 (https://phabricator.wikimedia.org/T197069)
[12:53:13] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db1089.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/443826 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui)
[12:56:18] <marostegui>	 !log Stop MySQL and reboot db1089 to upgrade+change it to statement - T197069
[12:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:22] <stashbot>	 T197069: Failover db1052 (s1) db primary master - https://phabricator.wikimedia.org/T197069
[13:03:27] <wikibugs>	 (03PS1) 10Ema: cache_misc: decommission cp3009 [puppet] - 10https://gerrit.wikimedia.org/r/443827 (https://phabricator.wikimedia.org/T148422)
[13:11:09] <wikibugs>	 (03PS1) 10ArielGlenn: switch en wiki dumps to run on snapshot1009 for now [puppet] - 10https://gerrit.wikimedia.org/r/443828 (https://phabricator.wikimedia.org/T198792)
[13:11:43] <wikibugs>	 (03CR) 10ArielGlenn: [C: 04-1] "do not merge, this is just ready in case we need it" [puppet] - 10https://gerrit.wikimedia.org/r/443828 (https://phabricator.wikimedia.org/T198792) (owner: 10ArielGlenn)
[13:25:08] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443829
[13:26:01] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db2038 and db2047 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443830
[13:28:07] <wikibugs>	 (03CR) 10Muehlenhoff: "a" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans)
[13:28:23] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443829 (owner: 10Marostegui)
[13:28:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2038 and db2047 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443830 (owner: 10Jcrespo)
[13:28:41] <volans>	 moritzm: I think I need some additinal context :-P ^^^^ ('a')
[13:28:49] <marostegui>	 jynus: you go first :)
[13:28:49] <wikibugs>	 10Operations, 10ops-esams: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403 (10mark) 05Open>03Resolved This has now been corrected and verified on-site.
[13:29:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Refactor client authentication (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans)
[13:29:45] <moritzm>	 volans: UI fail, now properly added as a comment :-)
[13:29:46] <volans>	 lol, thx
[13:30:01] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443829 (owner: 10Marostegui)
[13:30:04] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2038 and db2047 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443830 (owner: 10Jcrespo)
[13:30:17] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443829 (owner: 10Marostegui)
[13:33:22] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={create_container,podsandbox_status,remove_container,run_podsandbox,start_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:34:00] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2038, db2047 (duration: 02m 56s)
[13:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:17] <jynus>	 I know which ssh: connect to host snapshot1005.eqiad.wmnet
[13:34:23] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:34:44] <jynus>	 as you will be deploying next, marostegui, can you coordinate with apergos to depool it?
[13:34:48] <jynus>	 from scap
[13:35:05] <apergos>	 ah I have that in my patchset....
[13:35:15] <marostegui>	 I will wait for you to depool it then :)
[13:35:16] <apergos>	 when is the next deploy scheduled?
[13:35:17] <apergos>	 now?
[13:35:18] <jynus>	 so heads up, ^ marostegui
[13:35:19] <marostegui>	 before my deploy
[13:35:42] <apergos>	 oh yeah, wmf-config
[13:35:43] <apergos>	 sigh
[13:35:45] <marostegui>	 yeah :)
[13:35:52] <apergos>	 all right I'll just push it through now nbd
[13:36:23] <jynus>	 the others went good and no errors
[13:36:37] <jynus>	 plust codfw should not create much issues worse case scenario
[13:36:44] <volans>	 I guess it's not conftool-driven snapshot1005
[13:37:13] <marostegui>	 apergos: let me know when done and I will deploy and confirm if I see errors again
[13:37:13] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on bast3002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast3002&var-datasource=esams%2520prometheus%252Fops
[13:37:42] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] switch en wiki dumps to run on snapshot1009 for now [puppet] - 10https://gerrit.wikimedia.org/r/443828 (https://phabricator.wikimedia.org/T198792) (owner: 10ArielGlenn)
[13:37:54] <jynus>	 see you
[13:38:14] <marostegui>	 jynus: o/
[13:40:07] <wikibugs>	 (03PS14) 10Gehel: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson)
[13:40:09] <wikibugs>	 (03PS17) 10Gehel: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (owner: 10EBernhardson)
[13:40:11] <wikibugs>	 (03PS21) 10Gehel: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson)
[13:40:13] <wikibugs>	 (03PS24) 10Gehel: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (owner: 10EBernhardson)
[13:40:15] <wikibugs>	 (03PS52) 10Gehel: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson)
[13:40:30] <apergos>	 marostegui: that should do it, live on deploy1001
[13:40:32] <godog>	 cmjohnson1: hey, I was checking ms-be1036, host is fine but I can't reach its ipmi via ssh
[13:40:43] <marostegui>	 apergos: cool, let me see
[13:40:47] <apergos>	 I think chris is out the rest of this week, go dog
[13:41:08] <godog>	 ah! thanks apergos 
[13:41:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson)
[13:41:37] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 for after maintenance (duration: 00m 50s)
[13:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:48] <marostegui>	 apergos: worked fine, no timeouts on that host!
[13:41:49] <marostegui>	 thanks
[13:42:00] <apergos>	 volans: no these are not conftool configured, we don't want to depool unless someone has done so deliberately
[13:42:11] <apergos>	 it's a tiny cluster, we need almost every box
[13:42:31] <apergos>	 maroste gui: great!
[13:44:38] <volans>	 ack, thanks for the info ;)
[13:47:53] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443834
[13:47:57] <Cyberpower678>	 Hey guys, a blackout on itwiki/eswiki, and the other wikis is fine and all, but is it really appropriate to be redirecting the API to that blackout page.  I can safely say bots were never designed to handle redirects from the API.
[13:48:03] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:48:25] <Cyberpower678>	 IABot is throwing a tantrum right now.
[13:49:13] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:51:18] <Reedy>	 Cyberpower678: I'd probably file a task for that
[13:51:22] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443834 (owner: 10Marostegui)
[13:51:32] <Cyberpower678>	 Reedy: which project?
[13:52:23] <Reedy>	 Honestly not sure
[13:52:38] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443834 (owner: 10Marostegui)
[13:52:38] * Cyberpower678 makes a high priority task.
[13:52:43] <wikibugs>	 (03CR) 10Gehel: "puppet compiler looks reasonable: https://puppet-compiler.wmflabs.org/compiler02/11671/" [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson)
[13:54:09] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 for after maintenance (duration: 00m 50s)
[13:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:33] <Cyberpower678>	 Reedy: nevermind my bad.  The tantrum is a different problem.  The api.php root page is only redirecting.  All the actions are still functional.
[13:56:54] <Reedy>	 Ah, ok :)
[13:57:05] * Cyberpower678 just checks to make sure.
[13:58:23] * Cyberpower678 accesses the run logs of IABot
[14:08:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi)
[14:10:03] <wikibugs>	 (03PS5) 10Filippo Giunchedi: Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454)
[14:10:28] <moritzm>	 !log installing file/libmagic security updates on trusty (Debian already fixed)
[14:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:45] <wikibugs>	 10Operations, 10Cassandra, 10Services (watching), 10User-Eevans: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590 (10fgiunchedi) >>! In T128590#4396369, @Eevans wrote: > @fgiunchedi, is this still a thing?  Good question, I think we'll h...
[14:12:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi)
[14:17:20] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443835
[14:18:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443835 (owner: 10Marostegui)
[14:20:11] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443835 (owner: 10Marostegui)
[14:22:35] <wikibugs>	 (03PS4) 10Muehlenhoff: Allow removing Diamond gradually [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454)
[14:22:57] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 for after maintenance (duration: 00m 50s)
[14:22:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:07] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes: Switch to all in firewalling policy [puppet] - 10https://gerrit.wikimedia.org/r/443836
[14:24:59] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubernetes::staging: Switch to all in firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/443836
[14:25:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes::staging: Switch to all in firewall policy [puppet] - 10https://gerrit.wikimedia.org/r/443836 (owner: 10Alexandros Kosiaris)
[14:25:44] <akosiaris>	 !log upgrade kubernetes staging API server to 1.8.14
[14:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:31] <wikibugs>	 (03PS1) 10Andrew Bogott: labsaliaser:  suppress stdout in the cronjob [puppet] - 10https://gerrit.wikimedia.org/r/443837
[14:29:07] <wikibugs>	 (03PS2) 10Andrew Bogott: labsaliaser:  suppress stdout in the cronjob [puppet] - 10https://gerrit.wikimedia.org/r/443837
[14:30:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labsaliaser:  suppress stdout in the cronjob [puppet] - 10https://gerrit.wikimedia.org/r/443837 (owner: 10Andrew Bogott)
[14:31:02] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Adjust kubelet latencies thresholds [puppet] - 10https://gerrit.wikimedia.org/r/443838
[14:33:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Adjust kubelet latencies thresholds [puppet] - 10https://gerrit.wikimedia.org/r/443838 (owner: 10Alexandros Kosiaris)
[14:36:44] <moritzm>	 !log installing perl security updates on trusty (Debian already fixed)
[14:36:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:21] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet operation_type={podsandbox_status,remove_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[14:39:30] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[14:52:19] <moritzm>	 !log installing libipc-run-perl updates from jessie point release
[14:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] "I've changed the graphite udp dashboards to use Prometheus for UDP instead of Graphite" [puppet] - 10https://gerrit.wikimedia.org/r/442865 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi)
[14:55:42] <wikibugs>	 (03PS2) 10Filippo Giunchedi: statsite: deprecate Diamond udp collector [puppet] - 10https://gerrit.wikimedia.org/r/442865 (https://phabricator.wikimedia.org/T183454)
[15:00:24] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:02:24] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:12:20] <wikibugs>	 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10mark)
[15:19:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Allow to delete hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443362 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans)
[15:20:28] <wikibugs>	 10Operations, 10Cassandra, 10Services (watching), 10User-Eevans: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590 (10mobrovac) 05Open>03stalled
[15:27:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Allow to link hosts to external resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443363 (https://phabricator.wikimedia.org/T198590) (owner: 10Volans)
[15:28:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, but no DataTables expert here." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443364 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans)
[15:29:35] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968)
[15:29:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web: move mediawiki.org, test.wikidata to individual files [puppet] - 10https://gerrit.wikimedia.org/r/443842 (https://phabricator.wikimedia.org/T196968)
[15:29:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web: convert mediawiki.org to mediawiki::vhost [puppet] - 10https://gerrit.wikimedia.org/r/443843 (https://phabricator.wikimedia.org/T196968)
[15:29:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] DataTables: cleanup initialization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443365 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans)
[15:31:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans)
[15:33:51] <wikibugs>	 (03PS1) 10Ema: Decommission ex-cache_maps: cp300[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/443845 (https://phabricator.wikimedia.org/T167376)
[15:34:40] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::web: move mediawiki.org, test.wikidata to individual files [puppet] - 10https://gerrit.wikimedia.org/r/443842 (https://phabricator.wikimedia.org/T196968)
[15:37:46] <wikibugs>	 (03PS1) 10Ema: Remove DNS entries for esams ex-cache_maps: cp300[3-6] [dns] - 10https://gerrit.wikimedia.org/r/443846 (https://phabricator.wikimedia.org/T167376)
[15:38:18] <wikibugs>	 (03PS2) 10Ema: Decommission esams ex-cache_maps: cp300[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/443845 (https://phabricator.wikimedia.org/T167376)
[15:44:04] <wikibugs>	 (03CR) 10Ema: [C: 032] Decommission esams ex-cache_maps: cp300[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/443845 (https://phabricator.wikimedia.org/T167376) (owner: 10Ema)
[15:45:20] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443847
[15:46:35] <wikibugs>	 (03PS1) 10Mobrovac: c-for-each: Increase retry and delay defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/443848 (https://phabricator.wikimedia.org/T198787)
[15:47:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443847 (owner: 10Marostegui)
[15:47:21] <wikibugs>	 (03PS2) 10Mobrovac: c-foreach-restart: Increase retry and delay defaults [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/443848 (https://phabricator.wikimedia.org/T198787)
[15:48:35] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443847 (owner: 10Marostegui)
[15:50:15] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 for after maintenance (duration: 00m 51s)
[15:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:40] <marostegui>	 !log Optimize dewiki.logging on s5 codfw master with replication, this will generate lag on s5 codfw - T197459
[15:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:44] <stashbot>	 T197459: Optimize logging table - https://phabricator.wikimedia.org/T197459
[15:57:02] <wikibugs>	 (03PS2) 10Ema: Remove production DNS entries for esams ex-cache_maps [dns] - 10https://gerrit.wikimedia.org/r/443846 (https://phabricator.wikimedia.org/T167376)
[15:58:07] <wikibugs>	 (03PS3) 10Ema: Remove production DNS entries for esams ex-cache_maps [dns] - 10https://gerrit.wikimedia.org/r/443846 (https://phabricator.wikimedia.org/T167376)
[15:58:54] <wikibugs>	 (03CR) 10Ema: [C: 032] Remove production DNS entries for esams ex-cache_maps [dns] - 10https://gerrit.wikimedia.org/r/443846 (https://phabricator.wikimedia.org/T167376) (owner: 10Ema)
[16:01:49] <wikibugs>	 (03PS1) 10Ema: Remove mgmt DNS entries for esams ex-cache_maps [dns] - 10https://gerrit.wikimedia.org/r/443851 (https://phabricator.wikimedia.org/T167376)
[16:05:39] <icinga-wm>	 PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:15] <ema>	 marostegui: cp3008 down? ^
[16:06:24] <ema>	 sorry, I mean mark
[16:07:08] <icinga-wm>	 RECOVERY - Host cp3008 is UP: PING OK - Packet loss = 0%, RTA = 84.08 ms
[16:23:36] <mark>	 no alert for lvs300x management? :)
[16:24:16] <volans>	 mark: only lvs3001.mgmt
[16:25:02] <volans>	 actually just recovered
[16:31:33] <mark>	 sure
[16:31:37] <mark>	 i rerouted those cables
[16:44:12] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "wgRightsUrl is actually used as a default for the "dataRightsUrl" setting WikibaseRepo uses for the little JS popup messages that appear w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle)
[16:45:36] <wikibugs>	 (03PS4) 10Filippo Giunchedi: WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T171482)
[16:46:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T171482) (owner: 10Filippo Giunchedi)
[16:56:56] <wikibugs>	 (03PS7) 10Volans: wmf-auto-reimage: validate certificate fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/433928
[16:56:58] <wikibugs>	 (03PS5) 10Volans: wmf-auto-reimage: improve donwtime of reimaged host [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423)
[16:57:00] <wikibugs>	 (03PS3) 10Volans: wmf-auto-reimage: use absolute path for subprocess [puppet] - 10https://gerrit.wikimedia.org/r/434896
[16:57:02] <wikibugs>	 (03PS2) 10Volans: wmf-auto-reimage: fix parse argument bug [puppet] - 10https://gerrit.wikimedia.org/r/443670
[16:57:04] <wikibugs>	 (03PS2) 10Volans: wmf-auto-reimage: use warning log level [puppet] - 10https://gerrit.wikimedia.org/r/443671
[16:58:54] <wikibugs>	 (03CR) 10Volans: "@vgutierrez: since last review I've fixed two bugs (installer=True and --color=false) and improved the logging" [puppet] - 10https://gerrit.wikimedia.org/r/433928 (owner: 10Volans)
[17:01:12] <wikibugs>	 (03CR) 10Volans: "The latest changes are:" [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) (owner: 10Volans)
[17:08:06] <volans>	 sorry for the spam ;)
[17:17:22] <wikibugs>	 (03PS2) 10Volans: Refactor client authentication [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526)
[17:17:24] <wikibugs>	 (03PS2) 10Volans: Allow to delete hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443362 (https://phabricator.wikimedia.org/T198526)
[17:17:26] <wikibugs>	 (03PS2) 10Volans: Allow to link hosts to external resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443363 (https://phabricator.wikimedia.org/T198590)
[17:17:28] <wikibugs>	 (03PS2) 10Volans: DataTables: save state for the session [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443364 (https://phabricator.wikimedia.org/T198591)
[17:17:30] <wikibugs>	 (03PS2) 10Volans: DataTables: cleanup initialization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443365 (https://phabricator.wikimedia.org/T198591)
[17:17:32] <wikibugs>	 (03PS2) 10Volans: Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592)
[17:17:34] <wikibugs>	 (03PS2) 10Volans: DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592)
[17:17:36] <wikibugs>	 (03PS2) 10Volans: Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592)
[17:17:48] <wikibugs>	 (03CR) 10Volans: Refactor client authentication (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans)
[17:18:11] <wikibugs>	 (03CR) 10Volans: DataTables: refactor column grouping (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans)
[17:18:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor client authentication [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans)
[17:18:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Allow to delete hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443362 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans)
[17:18:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Allow to link hosts to external resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443363 (https://phabricator.wikimedia.org/T198590) (owner: 10Volans)
[17:18:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DataTables: save state for the session [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443364 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans)
[17:19:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DataTables: cleanup initialization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443365 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans)
[17:19:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans)
[17:19:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans)
[17:19:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans)
[17:31:24] <AaronSchulz>	 _joe_, Krinkle: so, I'd like to swat https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/440469/ tomorrow FYI
[17:59:59] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:04:19] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:04:39] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:05:48] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:36:18] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:40:48] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:47:28] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:59:29] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:04:59] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:07:09] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:31:19] <icinga-wm>	 RECOVERY - MD RAID on bast3002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[20:15:01] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db2038 and db2047 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443830 (owner: 10Jcrespo)
[20:15:03] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443834 (owner: 10Marostegui)
[20:15:05] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443835 (owner: 10Marostegui)
[20:15:07] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443847 (owner: 10Marostegui)
[20:48:49] <wikibugs>	 (03PS3) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529
[20:48:59] <wikibugs>	 (03PS3) 10Krinkle: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530
[21:11:11] <wikibugs>	 (03CR) 10Krinkle: "Indeed, on the cluster it is set directly in Wikibase.php:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle)
[21:11:17] <wikibugs>	 (03PS2) 10Krinkle: Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531
[21:15:47] <wikibugs>	 (03CR) 10Krinkle: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle)
[21:15:49] <wikibugs>	 (03PS2) 10Krinkle: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532
[21:22:28] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle)
[21:22:40] * Krinkle staging on depoy1001 and mwdebug1002
[21:23:48] <wikibugs>	 (03Merged) 10jenkins-bot: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle)
[21:24:52] <wikibugs>	 (03CR) 10Reedy: "Those defaults probably want bumping again :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle)
[21:27:02] <Krinkle>	 Reedy: Aye, bumping them can be scary though in terms of impact. I quite like them to be far in the past :P
[21:27:14] <Reedy>	 I used to do 6 months at a time
[21:27:25] <Krinkle>	 Just to confirm though, do you foresee an issue with moving them?
[21:27:32] <Reedy>	 Move them to yesterday, sure
[21:27:40] <Reedy>	 Move them to a year ago... Not so much
[21:27:42] <Krinkle>	 I mean moving the code.
[21:27:48] <Reedy>	 oh :
[21:27:49] <Reedy>	 :P
[21:27:58] <wikibugs>	 (03CR) 10jenkins-bot: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443529 (owner: 10Krinkle)
[21:28:03] <Reedy>	 Should be fine
[21:30:20] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 (owner: 10Krinkle)
[21:30:30] <Reedy>	 I guess it was in CommonSettings because no overrides at one point
[21:30:53] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Ie30aeecbe (duration: 00m 52s)
[21:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:35] <wikibugs>	 (03Merged) 10jenkins-bot: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 (owner: 10Krinkle)
[21:32:29] <wikibugs>	 (03CR) 10jenkins-bot: Move wgCacheEpoch override from Wikibase.php to InitialiseSettings (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443530 (owner: 10Krinkle)
[21:33:22] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle)
[21:34:07] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/: I1e78ba4365 (duration: 00m 52s)
[21:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:40] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle)
[21:35:49] <wikibugs>	 (03PS3) 10Krinkle: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532
[21:36:22] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle)
[21:36:28] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/: If7da2a26bbf (duration: 00m 51s)
[21:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:56] <wikibugs>	 (03CR) 10jenkins-bot: Remove wgRightsUrl from Wikibase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443531 (owner: 10Krinkle)
[21:37:38] <wikibugs>	 (03Merged) 10jenkins-bot: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle)
[21:41:47] <wikibugs>	 (03CR) 10jenkins-bot: Move wgRightsPage/wgRightsText from Wikibase.php to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443532 (owner: 10Krinkle)
[21:42:36] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Ia8deabc5d2625 (duration: 00m 50s)
[21:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:38] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/: Ia8deabc5d2625 (duration: 00m 51s)
[21:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:45] <wikibugs>	 (03PS3) 10Krinkle: Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533
[21:44:53] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 (owner: 10Krinkle)
[21:46:32] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 (owner: 10Krinkle)
[21:47:03] <wikibugs>	 (03PS3) 10Krinkle: Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534
[21:49:16] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 (owner: 10Krinkle)
[21:49:55] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/: I6f8cfa8f (duration: 00m 51s)
[21:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:28] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wmgUseClusterFileBackend (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443534 (owner: 10Krinkle)
[21:50:52] <wikibugs>	 (03CR) 10jenkins-bot: Remove wmgUseClusterFileBackend (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443533 (owner: 10Krinkle)
[21:52:52] <wikibugs>	 (03PS2) 10Krinkle: Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535
[21:53:35] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Iab176f3d205 (duration: 00m 50s)
[21:53:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:22] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535 (owner: 10Krinkle)
[21:55:37] <wikibugs>	 (03Merged) 10jenkins-bot: Move filebackend.php include towards the top near other includes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443535 (owner: 10Krinkle)
[22:06:00] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I5a6e28814ba6b7 (duration: 00m 51s)
[22:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:34] <wikibugs>	 (03PS2) 10Krinkle: Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 (owner: 10Reedy)
[22:17:37] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 (owner: 10Reedy)
[22:18:58] <wikibugs>	 (03Merged) 10jenkins-bot: Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 (owner: 10Reedy)
[22:23:58] <wikibugs>	 (03PS3) 10Krinkle: Move if onto newline in FeaturedFeedsWMF.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 (owner: 10Reedy)
[22:24:05] <Krinkle>	 Reedy: Mind if I land that as-is?
[22:24:31] <Reedy>	 Yeah
[22:24:45] <Reedy>	 That file is a messss
[22:25:25] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Could you explain the problem and solution in more detail? It's not exactly clear to me in what way this was a duplicate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm)
[22:25:35] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Move if onto newline in FeaturedFeedsWMF.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 (owner: 10Reedy)
[22:25:54] <Krinkle>	 Well, it's happy hour and you've got two patches for the price of 0.
[22:26:02] <Krinkle>	 Get 'em while you can :P
[22:26:52] <wikibugs>	 (03Merged) 10jenkins-bot: Move if onto newline in FeaturedFeedsWMF.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 (owner: 10Reedy)
[22:28:44] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/FeaturedFeedsWMF.php: I004bc9c3e71 (duration: 00m 50s)
[22:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:08] <wikibugs>	 (03PS1) 10Reedy: Bump default cache epochs from 20130601 to 20160101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443866
[22:30:21] <Reedy>	 :P
[22:31:32] <Reedy>	 Or do you want 2015? ;)
[22:31:37] <Krinkle>	 Reedy: So the reason I think it's high-impact (potentially) isn't due to purging of old cache values (that'd be good, and would only be a minority that old, we should be fine there), but due to purging of current values that use it as a hashing source.
[22:31:41] <Krinkle>	 At least RL uses it that way
[22:31:55] <Krinkle>	 which means bumping it or changing it in any way is the same as clearing all caches of all wikis.
[22:32:58] <Krinkle>	 It's a bit of a nuclear approach. I'd actually support removing its use from several code paths one by one.
[22:43:49] <Krinkle>	 Hm.. it's quite possible ThumbnailEpoch doesn't affect us anymore given that we use the 404 handler.
[22:43:52] <Krinkle>	 And also, Thumbor.
[22:44:22] <Krinkle>	 It's only used if RENDER_NOW is passed, which happens for UploadStash and ThumbnailJob, which are for recently uploaded files only.
[22:49:47] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Would recommend adding this to the multiversion/ directory instead so that PSR-4 and namespaces within wmf-config can be evaluated and dec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson)
[22:51:58] <wikibugs>	 (03CR) 10Urbanecm: "It was suggested by @MarcoAurelio in 440002. He says as all bureaucrats are sysops, it makes no sense to assign sysop-level privileges to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm)
[22:52:28] <wikibugs>	 (03CR) 10Urbanecm: "(to explain myself: sysops are allowed to grant IPBE everywhere)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm)
[23:23:46] <wikibugs>	 (03PS1) 10Krinkle: Improve file-level documentation for various wmf-config files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443870
[23:27:29] <wikibugs>	 (03PS2) 10Krinkle: Improve file-level documentation for various wmf-config files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443870