[00:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T0000).
[00:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:01:10] <wikibugs>	 (03PS1) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602
[00:01:45] <mutante>	 mobrovac: if we can say that this defined type is only for mediawiki/services.. then this way i would say? ^
[00:02:10] <mutante>	 it almost feels like a line similar to that existed at some point and got lost
[00:05:28] <wikibugs>	 (03PS2) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602
[00:08:00] <wikibugs>	 (03PS3) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366)
[00:10:09] <wikibugs>	 (03PS7) 10MarcoAurelio: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591)
[00:10:41] <icinga-wm>	 PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:11:53] <wikibugs>	 (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[00:12:05] <wikibugs>	 (03CR) 10MarcoAurelio: "Thanks for checking the logs. If there's nothing wrong (ie: "ERROR" or similar; I don't remember the exact output) and ops are content wit" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio)
[00:12:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:12:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:12:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:12:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:13:13] <wikibugs>	 (03Abandoned) 10MarcoAurelio: Disable NewUserMessage on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482679 (owner: 10MarcoAurelio)
[00:14:13] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:14:13] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:14:17] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:14:21] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:14:23] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:14:26] <wikibugs>	 (03PS8) 10Dzahn: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio)
[00:14:27] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:14:27] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:14:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:14:31] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:14:45] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:14:47] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:14:47] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:14:49] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:14:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:14:57] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:15:03] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:15:07] <paladox>	 hm
[00:15:09] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:15:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:15:13] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:15:13] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:15:15] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:15:15] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:15:20] <SQL>	 aaaaa hi
[00:15:24] <paladox>	 lol
[00:15:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:15:42] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418 (duration: 21m 34s)
[00:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:46] <stashbot>	 T212418: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418
[00:15:46] <SQL>	 heh, TBH I have mentions here muted, just thought that'd be funny
[00:16:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 3.07 seconds
[00:16:59] <wikibugs>	 (03CR) 10Mobrovac: [C: 03+1] visualdiff: ensure git clone happens before creating pngs dir [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[00:18:10] <wikibugs>	 (03PS2) 10Dzahn: visualdiff: ensure git clone happens before creating pngs dir [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366)
[00:19:14] <wikibugs>	 (03PS2) 10Dbarratt: Enable Partial Blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza)
[00:19:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 20.30 seconds
[00:19:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 9.11 seconds
[00:20:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.02 seconds
[00:20:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.21 seconds
[00:20:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.41 seconds
[00:20:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[00:20:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.49 seconds
[00:20:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.33 seconds
[00:20:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] visualdiff: ensure git clone happens before creating pngs dir [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[00:21:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:22:18] <wikibugs>	 10Operations, 10RESTBase-Cassandra, 10Services (next): restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10Pchelolo) A bit of background regarding the current appearance:  - When restbase1016 failed, it started logging with an extremelly ha...
[00:23:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:23:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:23:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:23:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "worked on scandium:" [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[00:24:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.51 seconds
[00:25:15] <mutante>	 re: dbstore1002 (also) https://phabricator.wikimedia.org/T206965
[00:27:13] <icinga-wm>	 PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer
[00:34:43] <wikibugs>	 (03PS4) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366)
[00:34:56] <wikibugs>	 (03CR) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[00:35:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[00:36:53] <icinga-wm>	 RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[00:37:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.83 seconds
[00:38:53] <wikibugs>	 (03PS5) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366)
[00:40:28] <wikibugs>	 (03PS6) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366)
[00:42:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.78 seconds
[00:42:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.30 seconds
[00:42:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.34 seconds
[00:43:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.50 seconds
[00:43:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.49 seconds
[00:43:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.21 seconds
[00:43:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.52 seconds
[00:45:23] <wikibugs>	 (03PS2) 10Thcipriani: Add Blubber directory to releases server [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563)
[00:46:04] <wikibugs>	 (03CR) 10Thcipriani: "> One minor omission. The profile needs to be included in the" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani)
[00:50:39] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave
[00:50:39] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave
[00:50:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag not a slave
[00:50:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2363.94 seconds
[00:50:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave
[00:50:59] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave
[00:51:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3314.28 seconds
[00:51:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3322.28 seconds
[00:51:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3281.29 seconds
[00:51:17] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave
[00:51:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2751.14 seconds
[00:51:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3331.14 seconds
[00:51:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3146.15 seconds
[00:56:45] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs2005 is CRITICAL: 8639 ge 3600 Stas Malychev db reload, will catch up soon https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[00:57:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.68 seconds
[00:57:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.99 seconds
[00:57:46] <wikibugs>	 (03PS9) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247)
[00:57:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.12 seconds
[00:58:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.13 seconds
[00:58:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.48 seconds
[00:58:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.03 seconds
[00:58:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.72 seconds
[00:58:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.51 seconds
[00:58:45] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.08 seconds
[00:59:04] <wikibugs>	 10Operations, 10RESTBase-Cassandra, 10Services (next): restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10mobrovac) >>! In T212424#4883122, @Pchelolo wrote: > As I understand, the driver does not recognize a node being marked as DOWN by Ca...
[01:02:15] <wikibugs>	 (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[01:02:43] <SMalyshev>	 !log repooled wdqs200[45] for now, 2006 still not done, will get to it later today
[01:02:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/14350/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani)
[01:16:06] <wikibugs>	 (03PS7) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366)
[01:17:30] <wikibugs>	 (03CR) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[01:35:25] <icinga-wm>	 PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[01:41:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.00 seconds
[01:41:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.27 seconds
[01:41:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.27 seconds
[01:41:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.80 seconds
[01:41:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.45 seconds
[01:41:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.27 seconds
[01:42:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.33 seconds
[01:42:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.61 seconds
[01:42:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.73 seconds
[01:52:17] <icinga-wm>	 RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 1094 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[01:53:12] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10bd808) **cloudservices1004** is the hot-spare in the cloudservices100[34] cluster supporting the eqiad1-r region of our OpenStack deploy. It should be fine to perform...
[01:54:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10bd808)
[01:57:12] <wikibugs>	 (03CR) 10Catrope: [C: 03+1] EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan)
[01:58:25] <icinga-wm>	 RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 976 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:29:45] <icinga-wm>	 PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer
[02:34:23] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711)
[02:34:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) (owner: 10BryanDavis)
[02:37:09] <icinga-wm>	 PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:38:13] <icinga-wm>	 RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0)
[02:40:03] <wikibugs>	 (03PS2) 10BryanDavis: toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711)
[02:47:35] <wikibugs>	 (03PS2) 10Mathew.onipe: elasticsearch_cluster: fix doc for is_green [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346
[02:53:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: fix doc for is_green [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[03:07:47] <icinga-wm>	 PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:10:45] <icinga-wm>	 PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer
[03:11:25] <icinga-wm>	 RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0)
[03:19:14] <wikibugs>	 (03PS3) 10Mathew.onipe: elasticsearch_cluster: fix doc for is_green [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346
[03:19:29] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch_cluster: fix doc for is_green (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[03:22:11] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10CDanis) It's fine to simply shut down `prometheus1003`.  We have a redundant machine `prometheus1004` which will continue gathering metrics and answering queries.  `pr...
[03:26:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.06 seconds
[03:26:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.91 seconds
[03:26:43] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.42 seconds
[03:26:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.46 seconds
[03:27:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.41 seconds
[03:27:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.56 seconds
[03:27:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.85 seconds
[03:27:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.56 seconds
[03:29:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 52.11 seconds
[03:29:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 5.49 seconds
[03:29:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 2.58 seconds
[03:29:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.44 seconds
[03:29:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
[03:29:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[03:29:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 0.33 seconds
[03:30:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[03:30:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 0.17 seconds
[03:42:24] <wikibugs>	 (03PS4) 10BryanDavis: toolforge: process dynamicproxy access logs [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001)
[03:52:35] <wikibugs>	 (03CR) 10BryanDavis: "> LGTM!  Has this been tested in toolforge already via cherry-pick?" [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis)
[04:05:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 58.33 seconds
[04:05:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 52.31 seconds
[04:05:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 45.46 seconds
[04:05:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 39.25 seconds
[04:05:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 37.50 seconds
[04:05:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 28.57 seconds
[04:05:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 20.89 seconds
[04:06:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 12.57 seconds
[04:12:15] <wikibugs>	 (03PS5) 10BryanDavis: cloud: rewrite spreadcheck.py NPRE check [puppet] - 10https://gerrit.wikimedia.org/r/483606
[04:16:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.39 seconds
[04:19:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[04:21:45] <wikibugs>	 (03CR) 10BryanDavis: "> The puppet changes seem incomplete but maybe I'm missing something" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483606 (owner: 10BryanDavis)
[04:24:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.53 seconds
[04:24:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.57 seconds
[04:24:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.62 seconds
[04:24:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.26 seconds
[04:25:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.85 seconds
[04:25:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.37 seconds
[04:25:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.44 seconds
[04:38:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.45 seconds
[04:39:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.11 seconds
[04:40:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.68 seconds
[04:40:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.37 seconds
[04:40:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.16 seconds
[04:40:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.61 seconds
[04:40:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.62 seconds
[04:40:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.98 seconds
[04:40:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.28 seconds
[05:22:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:28:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.69 seconds
[05:32:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational
[05:42:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 58.56 seconds
[05:42:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 56.87 seconds
[05:43:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 51.56 seconds
[05:43:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 50.61 seconds
[05:43:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 47.71 seconds
[05:43:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 47.44 seconds
[05:43:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 47.54 seconds
[05:43:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 47.48 seconds
[05:47:30] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10Goal, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan)
[05:47:33] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10User-Johan: Lessons learned: Communicating the server switch 2018 - https://phabricator.wikimedia.org/T206649 (10Johan) 05Open→03Resolved https://office.wikimedia.org/wiki/Community_Relations_Specialists/codfw/2018_lessons
[05:48:03] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10Goal, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) 05Open→03Resolved
[05:56:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 0.32 seconds
[05:56:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.29 seconds
[05:56:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.51 seconds
[05:56:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[05:56:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.18 seconds
[05:57:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[05:57:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.30 seconds
[05:57:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.03 seconds
[05:57:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[06:01:40] <SMalyshev>	 !log depooling wdq2005 and wdqs2006 for T213854
[06:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:43] <stashbot>	 T213854: Reload database on wdq2[456] from another server - https://phabricator.wikimedia.org/T213854
[06:04:41] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:04:47] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:04:49] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:04:49] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:04:55] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:04:57] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:04:57] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:04:59] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:04:59] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:05:11] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:05:15] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:05:17] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:05:27] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:05:27] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:05:35] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:05:39] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:05:39] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:05:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table arwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000312, end_log_pos 765943209
[06:05:39] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[06:06:11] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war
[06:06:15] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:06:25] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2006 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused
[06:06:46] <marostegui>	 !log Deploy schema change on db1067 (s1 primary master) - T85757
[06:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:48] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[06:07:08] <SMalyshev>	 !log started transfer wdqs2005->2006
[06:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:20] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Marostegui) >>! In T213422#4882368, @jcrespo wrote: > es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthin...
[06:10:12] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Re-point m3-master to dbproxy1008 [dns] - 10https://gerrit.wikimedia.org/r/484611 (https://phabricator.wikimedia.org/T213865)
[06:11:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 47.33 seconds
[06:16:26] <wikibugs>	 10Operations, 10DBA: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Don't worry, as soon as we arrange a date/time, I will stop it, so we are sure that no lag will happen before the failover. I will leave the screen running and just kill the process so you can...
[06:23:23] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2006 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war
[06:23:35] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999
[06:27:45] <wikibugs>	 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Kelson) I'm not sure to fully understand the technical explanation. Is the problem confirmed? If "yes", what is the plan to sol...
[06:28:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) Chris, can we do this today as this host will be the future s3 primary master?
[06:29:11] <icinga-wm>	 PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R]
[06:30:41] <icinga-wm>	 PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml]
[06:31:21] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled]
[06:31:55] <icinga-wm>	 PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mwrepl]
[06:33:21] <icinga-wm>	 PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean]
[06:36:03] <icinga-wm>	 RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK
[06:37:33] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:43:41] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis: Duplicate entry 29911752-enwiki for key echo_unread_wikis_user_wiki, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000312, end_log_pos 795338681
[06:44:53] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:46:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 30376.97 seconds
[06:47:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 71061.26 seconds
[06:49:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.37 seconds
[06:49:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.12 seconds
[06:49:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.50 seconds
[06:50:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.67 seconds
[06:50:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.16 seconds
[06:50:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.19 seconds
[06:50:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.27 seconds
[06:50:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.13 seconds
[06:54:39] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1078 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858)
[06:56:45] <icinga-wm>	 RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:23] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:57:59] <icinga-wm>	 RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:00] <wikibugs>	 (03CR) 10Tulsi Bhagat: "Requires `namespaceDupes.php --wiki=zhwikiversity --fix` after deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql)
[06:58:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.20 seconds
[06:58:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[06:59:16] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui)
[06:59:18] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Put s3 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858)
[06:59:27] <icinga-wm>	 RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:59:28] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui)
[07:00:25] <icinga-wm>	 RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:03:32] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858)
[07:05:58] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui)
[07:14:27] <marostegui>	 !log Upgrade MySQL on db2050 and db2036
[07:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:13] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui)
[07:27:02] <marostegui>	 !log Drop table tag_summary from s2 - T212255
[07:27:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:05] <stashbot>	 T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255
[07:36:13] <vgutierrez>	 !log powercycling cp1088 - T203194
[07:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:17] <stashbot>	 T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194
[07:39:02] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "That is a good one :)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484587 (owner: 10Thcipriani)
[07:39:04] <wikibugs>	 (03CR) 10Hashar: [V: 03+2 C: 03+2] Fix deploy_artifacts.py [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484587 (owner: 10Thcipriani)
[07:41:37] <icinga-wm>	 RECOVERY - Host cp1088 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[07:41:39] <icinga-wm>	 RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK
[07:41:41] <icinga-wm>	 RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK
[07:41:45] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK
[07:41:45] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK
[07:41:45] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK
[07:41:47] <icinga-wm>	 RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK
[07:41:49] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK
[07:41:51] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK
[07:41:53] <icinga-wm>	 RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK
[07:41:53] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK
[07:41:53] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK
[07:41:53] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK
[07:41:53] <icinga-wm>	 RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK
[07:41:55] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK
[07:41:55] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK
[07:41:55] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK
[07:42:07] <icinga-wm>	 RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK
[07:42:13] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK
[07:42:15] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK
[07:42:15] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK
[07:42:15] <icinga-wm>	 RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK
[07:42:15] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK
[07:42:17] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK
[07:42:17] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK
[07:42:22] <icinga-wm>	 RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK
[07:42:25] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK
[07:42:29] <icinga-wm>	 RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 40 ESP OK
[07:42:33] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK
[07:42:37] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK
[07:42:37] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK
[07:42:39] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK
[07:42:39] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK
[07:42:41] <icinga-wm>	 RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 40 ESP OK
[07:42:43] <icinga-wm>	 RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK
[07:42:45] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK
[07:42:45] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK
[07:47:11] <wikibugs>	 (03PS7) 10Wangql: Modifying configuration about Chinese Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919)
[07:51:06] <wikibugs>	 (03CR) 10Wangql: [C: 03+1] "> Patch Set 7: Verified+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql)
[07:53:17] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure.
[07:53:17] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1033 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure.
[07:53:17] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure.
[07:53:17] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1035 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure.
[07:53:17] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti1007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure.
[07:53:18] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on graphite1003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure.
[07:53:18] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on kubernetes1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure.
[07:54:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) cp1088 has been affected as well after the kernel upgrade
[08:01:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 120.76 seconds
[08:07:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10MoritzMuehlenhoff)
[08:11:52] <elukey>	 !log drop unneeded tables from the staging db on dbstore1002 according to T212493#4883535
[08:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:56] <stashbot>	 T212493: Clean up staging db - https://phabricator.wikimedia.org/T212493
[08:15:19] <marostegui>	 !log Upgrade MySQL on db2043 (s3 codfw master) 
[08:15:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:44] <akosiaris>	 !log depool codfw zotero for helm release cleanups
[08:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:05] <elukey>	 !log convert aria tables to innodb on dbstore1002 - T213706
[08:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:08] <stashbot>	 T213706: Convert Aria tables to InnoDB on dbstore1002 - https://phabricator.wikimedia.org/T213706
[08:20:45] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10MoritzMuehlenhoff)
[08:21:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10MoritzMuehlenhoff)
[08:22:07] <wikibugs>	 (03CR) 10星耀晨曦: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql)
[08:23:55] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table ruwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000312, end_log_pos 967590890
[08:24:31] <marostegui>	 !log Drop table tag_summary from s4 - T212255
[08:24:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:34] <stashbot>	 T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255
[08:25:21] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero install -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw]
[08:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:23] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed
[08:25:23] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero finished
[08:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:27] <wikibugs>	 (03CR) 10Muehlenhoff: "That's a misunderstanding; I told you that npm is now in stretch-backports, but the internally managed component/package is nodejs." [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[08:27:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet are marked down but pooled
[08:30:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled
[08:30:34] <akosiaris>	 expected ^
[08:30:42] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero install -n production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw]
[08:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:44] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed
[08:30:44] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero finished
[08:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:24] <godog>	 I'll be moving all mw logging to kafka shortly btw
[08:32:32] <godog>	 that's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/483384
[08:37:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 976.09 seconds
[08:37:39] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Default production logging to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483384 (https://phabricator.wikimedia.org/T211124)
[08:37:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Default production logging to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483384 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi)
[08:38:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 55.03 seconds
[08:38:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 54.14 seconds
[08:38:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 54.18 seconds
[08:38:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 53.44 seconds
[08:38:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 52.76 seconds
[08:38:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 51.98 seconds
[08:38:39] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[08:38:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 47.70 seconds
[08:38:59] <wikibugs>	 (03Merged) 10jenkins-bot: Default production logging to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483384 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi)
[08:40:16] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw]
[08:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:18] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed
[08:40:18] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero finished
[08:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:12] <logmsgbot>	 !log filippo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Default to new logging infrastructure - T211124 (duration: 01m 05s)
[08:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:16] <stashbot>	 T211124: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124
[08:42:23] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis: Duplicate entry 56881087-eswiki for key echo_unread_wikis_user_wiki, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000312, end_log_pos 972526823
[08:42:49] <akosiaris>	 duplicate key?
[08:43:01] <marostegui>	 dbstore1002 is screwed
[08:43:06] <marostegui>	 it had a crash the last few days
[08:43:11] <marostegui>	 so don't pay too much attention to it
[08:43:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy
[08:43:20] <akosiaris>	 ok
[08:43:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 259.48 seconds
[08:43:37] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[08:43:51] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy
[08:44:02] <apergos>	 ah ha
[08:44:05] <wikibugs>	 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10MoritzMuehlenhoff) I've removed Balazs from pwstore.
[08:44:10] <marostegui>	 it needs to be replaced asap anyways
[08:44:25] <wikibugs>	 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10MoritzMuehlenhoff)
[08:47:55] <akosiaris>	 !log repool zotero in codfw
[08:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] wmnet: Re-point m3-master to dbproxy1008 [dns] - 10https://gerrit.wikimedia.org/r/484611 (https://phabricator.wikimedia.org/T213865) (owner: 10Marostegui)
[08:51:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Re-point m3-master to dbproxy1008 [dns] - 10https://gerrit.wikimedia.org/r/484611 (https://phabricator.wikimedia.org/T213865) (owner: 10Marostegui)
[08:52:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.72 seconds
[08:53:01] <akosiaris>	 !log depool zotero eqiad for helm release cleanup
[08:53:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:02] <moritzm>	 !log installing systemd security updates for stretch
[08:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.46 seconds
[08:53:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.63 seconds
[08:53:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.48 seconds
[08:53:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.47 seconds
[08:53:45] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.99 seconds
[08:53:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.08 seconds
[08:53:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.07 seconds
[08:54:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.76 seconds
[08:55:40] <wikibugs>	 (03CR) 10jenkins-bot: Default production logging to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483384 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi)
[08:57:01] <marostegui>	 !log Re-point m3-master from dbproxy1003 to dbproxy1008 - T213865
[08:57:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:04] <stashbot>	 T213865: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865
[08:57:38] <marostegui>	 I have moved m3-master to a different dbproxy, if you notice something strange with phabricator please let me know (T213865)
[08:58:06] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero install -n production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad]
[08:58:07] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed
[08:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:07] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero finished
[08:58:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:53] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet are marked down but pooled
[08:59:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes1001.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled
[08:59:14] <akosiaris>	 expected ^ should recover in the next 1min or os
[08:59:16] <akosiaris>	 so*
[09:00:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy
[09:00:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[09:00:32] <godog>	 !log test roll-restart rsyslog on mw hosts in eqiad - T211124 
[09:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:34] <stashbot>	 T211124: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124
[09:00:55] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:03:19] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:06:41] <icinga-wm>	 PROBLEM - High lag on wdqs2006 is CRITICAL: 1.101e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:06:45] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis: Duplicate entry 320104-enwiki for key echo_unread_wikis_user_wiki, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000312, end_log_pos 1000755095
[09:07:48] <marostegui>	 i think we need to reimport that table
[09:08:38] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) a:03Marostegui This has been done
[09:08:46] <elukey>	 is it a drop + replicate from scratch?
[09:09:23] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui)
[09:09:35] <marostegui>	 we have to stop that replication thread in sync with another host, and mysqldump that specific table
[09:10:39] <dcausse>	 !log T210381: elasticsearch search cluster, creating completion suggester indices on psi&omega elastic instances in eqiad&codfw 
[09:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:42] <stashbot>	 T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381
[09:10:46] <elukey>	 ah oko makes sense
[09:15:19] <icinga-wm>	 PROBLEM - configured eth on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer
[09:15:47] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer
[09:16:16] <marostegui>	 !log Stop replication in sync on dbstore1002:x1 and db2034 - T213670
[09:16:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:19] <stashbot>	 T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670
[09:17:09] <godog>	 !log powercycle ms-be1016 - T213856
[09:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:12] <stashbot>	 T213856: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856
[09:17:35] <icinga-wm>	 PROBLEM - Disk space on wdqs2006 is CRITICAL: DISK CRITICAL - free space: /srv 53055 MB (3% inode=99%)
[09:17:49] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:18:03] <icinga-wm>	 PROBLEM - Host ms-be1016 is DOWN: PING CRITICAL - Packet loss = 100%
[09:18:19] <icinga-wm>	 PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer
[09:18:55] <icinga-wm>	 RECOVERY - configured eth on ms-be1034 is OK: OK - interfaces up
[09:20:19] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[09:20:21] <icinga-wm>	 RECOVERY - Host ms-be1016 is UP: PING WARNING - Packet loss = 73%, RTA = 1.20 ms
[09:20:37] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational
[09:20:37] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1034 is OK: OK - load average: 66.63, 73.16, 68.29
[09:20:45] <icinga-wm>	 RECOVERY - Disk space on ms-be1016 is OK: DISK OK
[09:20:51] <icinga-wm>	 RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[09:24:57] <icinga-wm>	 RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:25:07] <godog>	 ok ms-be1016 is back, raid controller was unhappy, whereas ms-be1034 seems back in line too
[09:29:34] <marostegui>	 !log Stop s3 actor-migration script in order to allow s3 to catch up and to avoid lag during the failover - T188327 T213858
[09:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:38] <stashbot>	 T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327
[09:29:39] <stashbot>	 T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858
[09:31:14] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @Anomie I have stopped the script as we are most likely going to go ahead with the failover in EU morning (still waiting for the managers to confirm)
[09:32:25] <icinga-wm>	 RECOVERY - Disk space on wdqs2006 is OK: DISK OK
[09:33:45] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.90 seconds
[09:33:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.40 seconds
[09:33:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.92 seconds
[09:34:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.33 seconds
[09:34:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.17 seconds
[09:34:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.25 seconds
[09:34:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.37 seconds
[09:34:21] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs2006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time
[09:34:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.40 seconds
[09:34:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.31 seconds
[09:34:59] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war
[09:35:05] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2006 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused
[09:37:59] <wikibugs>	 (03PS7) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572)
[09:38:01] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858)
[09:38:03] <wikibugs>	 (03PS1) 10Addshore: BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031)
[09:38:05] <wikibugs>	 (03PS1) 10Addshore: testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031)
[09:38:07] <wikibugs>	 (03PS1) 10Addshore: testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031)
[09:38:44] <wikibugs>	 (03PS1) 10Addshore: wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031)
[09:39:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo)
[09:39:15] <wikibugs>	 (03PS1) 10Addshore: wikidata: post edit constraint jobs on 5% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484625 (https://phabricator.wikimedia.org/T204031)
[09:39:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo)
[09:40:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui)
[09:40:38] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo)
[09:41:28] <wikibugs>	 (03PS1) 10Addshore: wikidata: post edit constraint jobs on 10% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484629 (https://phabricator.wikimedia.org/T204031)
[09:41:30] <wikibugs>	 (03PS1) 10Addshore: wikidata: post edit constraint jobs on 25% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484630 (https://phabricator.wikimedia.org/T204031)
[09:42:12] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) pc1004 can be (and should be) powered off. That host is ready for decommissioning, I have not powered off myself as I am not sure if Chris is wiping disks...
[09:42:25] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 (duration: 00m 52s)
[09:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:31] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui)
[09:42:36] <wikibugs>	 (03PS2) 10Addshore: wikidata: post edit constraint jobs on 25% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484630 (https://phabricator.wikimedia.org/T204031)
[09:43:44] <wikibugs>	 (03PS1) 10Addshore: wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484633 (https://phabricator.wikimedia.org/T204031)
[09:44:49] <wikibugs>	 (03PS1) 10Addshore: wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484635 (https://phabricator.wikimedia.org/T204031)
[09:45:31] <wikibugs>	 (03PS2) 10Addshore: wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484635 (https://phabricator.wikimedia.org/T204031)
[09:46:34] <addshore>	 jouncebot: next
[09:46:34] <jouncebot>	 In 2 hour(s) and 13 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1200)
[09:46:37] <addshore>	 jouncebot: now
[09:46:37] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 13 minute(s)
[09:47:25] <jynus>	 !log upgrade and restart db1077
[09:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:22] <addshore>	 jouncebot: refresh
[09:48:24] <jouncebot>	 I refreshed my knowledge about deployments.
[09:48:26] <addshore>	 jouncebot: next
[09:48:26] <jouncebot>	 In 1 hour(s) and 11 minute(s): WikibaseQualityConstraints post edits jobs (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1100)
[09:48:36] <addshore>	 hmm
[09:49:14] <addshore>	 jouncebot: refresh
[09:49:14] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo)
[09:49:15] <jouncebot>	 I refreshed my knowledge about deployments.
[09:49:17] <addshore>	 jouncebot: next
[09:49:17] <jouncebot>	 In 0 hour(s) and 10 minute(s): WikibaseQualityConstraints post edits jobs (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1000)
[09:52:45] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs2006 is OK: HTTP OK: HTTP/1.1 200 OK - 449 bytes in 0.646 second response time
[09:52:59] <godog>	 !log upgrade controller firmware on ms-be1016 - T213856
[09:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:01] <stashbot>	 T213856: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856
[09:53:05] <wikibugs>	 (03PS11) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850)
[09:53:05] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1077.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1077.eqiad.wmnet (111 Connection refused)
[09:53:13] <marostegui>	 ^ expected
[09:53:23] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2006 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war
[09:53:29] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999
[09:53:40] <wikibugs>	 (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[09:53:44] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640
[09:54:41] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[09:55:22] <wikibugs>	 (03PS1) 10Gehel: This script has been moved to the puppet repository. [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/484641
[09:59:16] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640
[09:59:18] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858)
[10:00:04] <jouncebot>	 addshore: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for WikibaseQualityConstraints post edits jobs . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1000).
[10:00:04] <jouncebot>	 addshore: A patch you scheduled for WikibaseQualityConstraints post edits jobs is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[10:00:41] <addshore>	 \o
[10:00:47] <addshore>	 starting with beta ...
[10:00:58] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10fgiunchedi)
[10:01:13] <wikibugs>	 (03PS2) 10Addshore: BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031)
[10:01:18] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:01:27] <wikibugs>	 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like the raid controller freaked out, a reboot "fixed" it. I've upgraded the firmware too: https://wikitech.wikimedia.org/wiki/Platfor...
[10:01:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 775.24 seconds
[10:02:23] <wikibugs>	 (03Merged) 10jenkins-bot: BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:02:37] <wikibugs>	 (03CR) 10jenkins-bot: BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:03:12] <wikibugs>	 (03PS2) 10Addshore: testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031)
[10:03:54] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: BETA ONLY, [[gerrit:484621]] (duration: 00m 52s)
[10:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:05] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes
[10:04:15] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:05:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo)
[10:05:51] <addshore>	 marostegui: you want to deploy that one? :)
[10:06:01] <wikibugs>	 (03Merged) 10jenkins-bot: testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:06:35] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi)
[10:06:37] <wikibugs>	 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10Patch-For-Review: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed!
[10:08:19] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe) a:03Mathew.onipe
[10:08:50] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe)
[10:13:14] <logmsgbot>	 !log addshore@deploy1001 sync-file aborted: testwikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 50 T204031 [[gerrit:484621]] (duration: 00m 00s)
[10:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:17] <stashbot>	 T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031
[10:13:49] <wikibugs>	 (03PS2) 10Addshore: testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031)
[10:14:10] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:14:13] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: testwikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 50 T204031 [[gerrit:484621]] (duration: 00m 52s)
[10:14:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:59] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[10:15:14] <wikibugs>	 (03Merged) 10jenkins-bot: testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:15:36] <wikibugs>	 (03CR) 10jenkins-bot: testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:15:38] <wikibugs>	 (03CR) 10jenkins-bot: testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:16:11] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time
[10:16:47] <wikibugs>	 (03PS8) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572)
[10:17:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) (owner: 10ArielGlenn)
[10:18:16] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi)
[10:18:20] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed with mediawiki logging fully switched to new logging infra:  {F27909234}
[10:18:41] <logmsgbot>	 !log addshore@deploy1001 sync-file aborted: testwikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 100 T204031 [[gerrit:484621]] (duration: 00m 02s)
[10:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:44] <stashbot>	 T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031
[10:18:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.06 seconds
[10:19:41] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: testwikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 100 T204031 [[gerrit:484621]] (duration: 00m 52s)
[10:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:46] <elukey>	 !log executed kafka preferred-replica-election on the logging Kafka cluster as attempt to spread load more uniformly
[10:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:29] <wikibugs>	 (03PS2) 10Addshore: wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031)
[10:23:34] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:24:09] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi)
[10:24:41] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:25:08] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 with low load (duration: 00m 51s)
[10:25:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:04] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolv...
[10:26:34] <jynus>	 several db errors on testwikidata, addshore FYI
[10:26:41] <addshore>	 jynus: y*checking*
[10:26:56] <jynus>	 https://logstash.wikimedia.org/goto/ea0517f74f8810ac0572c307e8552cc3
[10:27:37] <addshore>	 hmm, thats that deadlock *find ticket*
[10:27:53] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[10:28:00] <addshore>	 https://phabricator.wikimedia.org/T205045
[10:28:09] <godog>	 !log restart rsyslog on wezen, tls listener stuck
[10:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:15] <wikibugs>	 (03CR) 10jenkins-bot: wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:29:09] <addshore>	 we are going to tweak the batch size next week for that
[10:29:09] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 948 days)
[10:29:59] <jynus>	 I don't think it is that
[10:30:14] <jynus>	 that exists, but it is know, etc, mostly on wikidata
[10:31:09] <jynus>	 my pointer was to testwikidata https://logstash.wikimedia.org/goto/70d5095cbec80ff162c811b9b2ce3f58
[10:31:30] <jynus>	 but it was only a heads up as I saw you deploying, and these errors are from mwdebug
[10:31:42] <addshore>	 ack!
[10:32:00] <addshore>	 aah, the first logstash link showed me something different, let me look at these
[10:32:13] <jynus>	 sorry I was unclear
[10:33:25] <addshore>	 interesting, they all came from mwdebug1002
[10:33:28] <jynus>	 (no need to report to me, it was just a friendly "you may have missed those")
[10:33:56] <jynus>	 and I try to be pedantic if that can cause an outage later
[10:33:59] <addshore>	 its definitely me that triggered them, but not sure how or why, I'm pretty sure they are not related to the patches I'm deploying right now
[10:34:02] <addshore>	 thanks for the poke!
[10:35:05] <addshore>	 its is perhaps just because i flicked the "log" option in the mwdebug browser extension, so we get a bunch of logs that we don't normally see in logstash for the requests that I was making
[10:35:27] <addshore>	 aaah yes, they are all DEBUG level :) 
[10:35:34] <jynus>	 cool then
[10:35:39] <addshore>	 thanks!
[10:35:47] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "A couple of minor comments inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[10:38:00] <wikibugs>	 (03PS9) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572)
[10:38:25] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 1% T204031 [[gerrit:484621]] (duration: 00m 52s)
[10:38:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:28] <stashbot>	 T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031
[10:38:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.01 seconds
[10:39:50] <marostegui>	 twentyafterfour: We have failed over dbproxy1003 to dbproxy1008 which is phabricator, I still see two connections (they have been there for hours) going thru dbproxy1003, can you restart phabricator? T213865
[10:39:50] <stashbot>	 T213865: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865
[10:40:36] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) @mmodell ` ˜/marostegui 11:39> twentyafterfour: We have failed over dbproxy1003 to dbproxy1008 which is phabricator, I still see two connections (they have been there for ho...
[10:42:45] <wikibugs>	 (03PS1) 10Addshore: wgWBQualityConstraintsEnableConstraintsCheckJobs true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648
[10:42:56] <wikibugs>	 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10jbond) 05Open→03Resolved
[10:43:08] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsEnableConstraintsCheckJobs true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 (owner: 10Addshore)
[10:43:49] <wikibugs>	 (03PS2) 10Addshore: wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648
[10:44:13] <wikibugs>	 (03PS1) 10Addshore: wgWBQualityConstraintsEnableConstraintsCheckJobs true wd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484649
[10:44:32] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 (owner: 10Addshore)
[10:45:36] <wikibugs>	 (03Merged) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 (owner: 10Addshore)
[10:46:29] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407 (10Nemo_bis) > There was a large discussion on irc spanning wikimedia-mailman and -ops that boils down to no one is comfortable or feels it is good practice to index lists that explic...
[10:48:21] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd (duration: 00m 52s)
[10:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:07] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsEnableConstraintsCheckJobs true wd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484649 (owner: 10Addshore)
[10:51:43] <wikibugs>	 (03Merged) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs true wd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484649 (owner: 10Addshore)
[10:52:09] <marostegui>	 twentyafterfour: nevermind my previous comment, I have killed them
[10:52:09] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) >>! In T213865#4883806, @Marostegui wrote: > @mmodell > ` > ˜/marostegui 11:39> twentyafterfour: We have failed over dbproxy1003 to dbproxy1008 which is phabricator, I still...
[10:53:28] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgWBQualityConstraintsEnableConstraintsCheckJobs true wd (duration: 00m 52s)
[10:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:38] <wikibugs>	 (03CR) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 (owner: 10Addshore)
[10:54:40] <wikibugs>	 (03CR) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs true wd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484649 (owner: 10Addshore)
[10:56:13] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "> Patch Set 1:" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond)
[10:57:14] <wikibugs>	 (03PS1) 10Addshore: wgWBQualityConstraintsEnableConstraintsCheckJobs false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484652 (https://phabricator.wikimedia.org/T204031)
[10:57:29] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsEnableConstraintsCheckJobs false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484652 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:58:35] <wikibugs>	 (03Merged) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484652 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[10:59:45] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgWBQualityConstraintsEnableConstraintsCheckJobs false (duration: 00m 51s)
[10:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:52] <addshore>	 !log slot done
[10:59:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:07] <icinga-wm>	 PROBLEM - DPKG on analytics1051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:00:19] <icinga-wm>	 PROBLEM - DPKG on analytics1068 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:00:58] <wikibugs>	 (03CR) 10Addshore: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484625 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[11:02:53] <icinga-wm>	 PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools]
[11:02:57] <fsero>	 !log draining kubernetes1001 for maintenance T213859
[11:02:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:59] <stashbot>	 T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859
[11:03:10] <wikibugs>	 (03PS2) 10Jbond: Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845)
[11:04:57] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond)
[11:07:42] <moritzm>	 ^ analytics/dpkg should recover soon
[11:08:15] <wikibugs>	 (03CR) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484652 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[11:09:59] <icinga-wm>	 RECOVERY - DPKG on analytics1051 is OK: All packages OK
[11:10:11] <icinga-wm>	 RECOVERY - DPKG on analytics1068 is OK: All packages OK
[11:12:37] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-herron: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10fgiunchedi) p:05Triage→03Normal
[11:13:17] <icinga-wm>	 RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:13:48] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) p:05Triage→03Normal
[11:15:32] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi)
[11:15:37] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi)
[11:16:08] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi)
[11:16:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond)
[11:16:12] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi)
[11:16:34] <wikibugs>	 (03PS4) 10Mathew.onipe: elasticsearch_cluster: change is_green() implementation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346
[11:16:48] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch_cluster: change is_green() implementation (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[11:17:55] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi)
[11:17:56] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi One of these approaches will be implemented as part of stretch goals of {T213157}
[11:19:07] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi)
[11:19:12] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This was completed, there will be more followup while deprecat...
[11:22:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+2] Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond)
[11:22:22] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi)
[11:22:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10elukey) Ok for me for the analytics nodes, but I'd need a bit of heads up to properly stop them if possible :)  The Thursday time window proposal is fine for me!
[11:23:21] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond)
[11:23:37] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi)
[11:23:39] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi)
[11:24:54] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-herron: [stretch] Implement sensitive log access control, onboard 3 sensitive log producers - https://phabricator.wikimedia.org/T213902 (10fgiunchedi) p:05Triage→03Normal
[11:25:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cookbooks.sre.hosts: improve upgrade-and-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/484422 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans)
[11:28:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, very minor nitpick inline." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[11:28:56] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cookbooks.sre.hosts: improve upgrade-and-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/484422 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans)
[11:30:43] <wikibugs>	 (03Merged) 10jenkins-bot: cookbooks.sre.hosts: improve upgrade-and-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/484422 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans)
[11:32:54] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi)
[11:32:57] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi)
[11:33:02] <wikibugs>	 (03PS7) 10Volans: sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema)
[11:33:37] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi)
[11:33:39] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi codfw is done, resolving in favor of {T213898}
[11:43:40] <wikibugs>	 10Operations, 10monitoring, 10Graphite, 10MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), 10MW-1.27-release-notes: UDP rcvbuferrors and inerrors on graphite hosts - https://phabricator.wikimedia.org/T101141 (10fgiunchedi)
[11:44:37] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] sre.hosts: add varnish upgrade cookbook (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema)
[11:52:09] <wikibugs>	 (03PS5) 10Mathew.onipe: elasticsearch_cluster: change is_green() implementation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346
[11:52:29] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch_cluster: change is_green() implementation (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[11:52:50] <addshore>	 \o zeljkof 
[11:53:22] <addshore>	 i have 2 backports in the swat (in 8 mins) im gonna hit +2 on them now, as the CI will likely take a pretty long time :) they will be the last 2 patches to get deployed
[11:53:37] <addshore>	 that is, if your around for swat, otherwise im talking to the wrong person! :D
[11:53:45] <zeljkof>	 addshore: ok
[11:54:18] <addshore>	 [=
[11:57:08] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10jcrespo) a:05Cmjohnson→03jcrespo Taking care of it.
[11:59:00] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[12:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1200).
[12:00:04] <jouncebot>	 Thiemo_WMDE, davidwbarratt, dcausse, and addshore: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:10] <addshore>	 \o
[12:00:15] <dcausse>	 o/
[12:00:20] <davidwbarratt>	 here!
[12:00:25] * addshore can deploy deploy his 2 patches right at the end [=
[12:00:31] <zeljkof>	 o/
[12:00:51] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Cmjohnson) While the server was down I updated, BIOS, raid firmware and hardware firmware to the latest updates
[12:00:58] <zeljkof>	 Thiemo_WMDE, davidwbarratt, dcausse, and addshore: the last two are deployers, right? the first two? ;)
[12:01:06] <dcausse>	 I can go last, mine will take some time to test
[12:01:14] <zeljkof>	 (I can SWAT today for people that are not deployers)
[12:01:37] <zeljkof>	 ok, dcausse and addshore, I'll let you know when I'm done, so the two of you self-organize
[12:01:40] <davidwbarratt>	 thanks!
[12:01:55] <zeljkof>	 Thiemo_WMDE, davidwbarratt: you're not deployers?
[12:01:58] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10jcrespo) So this is solved?
[12:02:16] <davidwbarratt>	 zeljkof correct!
[12:02:41] <zeljkof>	 davidwbarratt: ok, you're first then, looks like Thiemo_WMDE is not around
[12:02:55] <addshore>	 CFisch_WMDE: ^^
[12:03:02] <zeljkof>	 I'll let you know when your patch is at mwdebug1002, ready for testing, in a few minutes, let me know if you need help with testing there
[12:03:06] <zeljkof>	 (there are docs)
[12:03:28] <CFisch_WMDE>	 zeljkof:
[12:03:31] <davidwbarratt>	 oki eodkie 
[12:03:32] <CFisch_WMDE>	 go for it
[12:03:47] * Thiemo_WMDE is here.
[12:03:49] <davidwbarratt>	 ha, mangled that
[12:05:25] <wikibugs>	 (03PS3) 10Zfilipin: Enable Partial Blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza)
[12:05:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto)
[12:06:20] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza)
[12:07:27] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Partial Blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza)
[12:08:20] <zeljkof>	 davidwbarratt: 476884 is at mwdebug1002, please test and let me know if I can deploy it
[12:08:45] <davidwbarratt>	 testing
[12:10:07] <davidwbarratt>	 looks beautiful
[12:11:01] <zeljkof>	 davidwbarratt: ok to deploy?
[12:11:16] <davidwbarratt>	 zeljkof yes. :) thank you!
[12:12:02] <zeljkof>	 ok, deploying
[12:12:27] <jynus>	 !log upgrade and restart db1095
[12:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:52] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:476884|Enable Partial Blocks on itwiki (T210444)]] (duration: 00m 53s)
[12:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:55] <stashbot>	 T210444: SWAT deploy PartialBlocks on Italian Wikipedia — Jan 16, 2019 - https://phabricator.wikimedia.org/T210444
[12:13:24] <zeljkof>	 davidwbarratt: it's deployed, please test and thanks for deploying with #releng :)
[12:13:42] <davidwbarratt>	 it looks amazing! thanks!
[12:13:42] <zeljkof>	 Thiemo_WMDE: please stand by, your're next
[12:13:50] <Daimona>	 (Gonna test it too, is it on mwdebug or already deployed?)
[12:13:50] <wikibugs>	 (03PS3) 10Zfilipin: Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch)
[12:14:00] <zeljkof>	 davidwbarratt: already deployed
[12:14:01] <wikibugs>	 (03CR) 10jenkins-bot: Enable Partial Blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza)
[12:14:12] <zeljkof>	 Daimona: sorry, it was for you :)
[12:14:15] <zeljkof>	 ^
[12:14:20] <Daimona>	 zeljkof Alright, thanks :)
[12:14:31] <davidwbarratt>	 you can test here if you have permissions: https://it.wikipedia.org/wiki/Speciale:Blocca
[12:14:48] <davidwbarratt>	 note... there is a missing translation, but that word "Sitewide" has been in the repo for... over a month. 
[12:15:32] <Thiemo_WMDE>	 ?
[12:15:34] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch)
[12:15:46] <CFisch_WMDE>	 \o/
[12:15:56] <Daimona>	 Already tested, seems to work and nothing suspicious on logstash
[12:16:06] <davidwbarratt>	 Daimona YAY! thanks!
[12:16:08] <Daimona>	 Just some messages which need to be translated, I'll do that later
[12:16:15] <zeljkof>	 Thiemo_WMDE: just wanted to make sure you're around, your commit will be deployed to mwdebug soon
[12:16:41] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch)
[12:16:42] <Reedy>	 Oh swat
[12:17:18] <zeljkof>	 Reedy: you have to say it with desperation in voice ;)
[12:17:18] <Reedy>	 zeljkof: Can you do https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/484523/ please?
[12:17:27] <davidwbarratt>	 Daimona thank you!
[12:17:36] <Hauskatze>	 Reedy: I was asked if we have some estimate as to when nap.wikisource could be created
[12:17:44] <Hauskatze>	 but I couldn't answer
[12:17:51] <zeljkof>	 Reedy: sure, but I'm almost done, you can do it yourself in a few minutes, if you prefer
[12:17:57] <Reedy>	 I don't mind
[12:18:06] <Reedy>	 Hauskatze: mañana
[12:18:17] <Hauskatze>	 Reedy: vale
[12:18:36] <zeljkof>	 Thiemo_WMDE: 484513 is at mwdebug1002, please test and let me know if I can deploy it cc CFisch_WMDE 
[12:18:48] <Hauskatze>	 Reedy: then please let me know if there are missing patches or something, or will you take care of them?
[12:19:09] <zeljkof>	 Reedy: please add your commit to the calendar
[12:19:15] <addshore>	 Reedy: do this? > addshore@deploy1001:/srv/mediawiki-staging/php-1.33.0-wmf.12$ git log HEAD..origin/wmf/1.33.0-wmf.12 :(
[12:19:41] <Reedy>	 Hauskatze: As long as DNS/apache type commits are done so I can JFDI to create the wiki...
[12:19:43] <Thiemo_WMDE>	 zeljkof: Done, confirmed.
[12:19:48] <Reedy>	 Hauskatze: The other one is I think addWiki is broken still
[12:19:48] <Daimona>	 davidwbarratt no prob :)
[12:19:58] <zeljkof>	 Thiemo_WMDE: ok to deploy?
[12:20:07] <Hauskatze>	 Reedy: I think DNS was done and no Apache needed as nap.wikipedia already exists
[12:20:10] <Thiemo_WMDE>	 zeljkof: Ok.
[12:20:18] <zeljkof>	 ok, deploying
[12:20:19] <Hauskatze>	 let me browse the checklist though
[12:20:51] <Reedy>	 addshore: wut?
[12:21:08] <addshore>	 there is what looks like a massive centralnotice commit there that I wasnt expecting?
[12:21:18] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:484513|Deploy the FileExporter as a beta feature on all Wikimedia wikis (T213425)]] (duration: 00m 53s)
[12:21:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 40.72 seconds
[12:21:22] <stashbot>	 T213425: Deploy the FileExporter as a beta feature on all Wikimedia wikis - https://phabricator.wikimedia.org/T213425
[12:21:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 35.21 seconds
[12:21:26] <addshore>	 *goes to find it on gerrit*
[12:21:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 27.11 seconds
[12:21:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 1.16 seconds
[12:21:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.27 seconds
[12:21:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.02 seconds
[12:21:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[12:21:56] <zeljkof>	 Thiemo_WMDE: it's deployed, please test and thanks for deploying with #releng :)
[12:22:07] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.31 seconds
[12:22:11] <addshore>	 AndyRussG: I think the CentralNotice commit I'm seeing is yours?
[12:22:17] <wikibugs>	 (03PS4) 10Reedy: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848)
[12:22:28] <Reedy>	 addshore: Is it their autobumping autotracking stuff?
[12:22:32] <zeljkof>	 Reedy, dcausse, addshore: I'm done, go ahead with your patches, please self organize :)
[12:22:40] <addshore>	 zeljkof: thanks
[12:22:43] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) (owner: 10Reedy)
[12:22:50] <Thiemo_WMDE>	 zeljkof: Works as expected on live system. Thanks!
[12:23:02] <addshore>	 Reedy: not sure, the commit message is massive.. i wish these commits would stop appearing
[12:23:46] <wikibugs>	 (03Merged) 10jenkins-bot: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) (owner: 10Reedy)
[12:24:04] <addshore>	 Reedy: the only thing I know what to do now, is yell in here about it :P
[12:25:03] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) Yes
[12:25:11] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/throttle.php: T213848 (duration: 00m 53s)
[12:25:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 56.29 seconds
[12:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 53.15 seconds
[12:25:13] <stashbot>	 T213848: Requesting temporary lift of IP cap on fr.wikipedia.org - https://phabricator.wikimedia.org/T213848
[12:25:21] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui)
[12:25:26] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) 05Open→03Resolved
[12:25:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 51.32 seconds
[12:25:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 49.06 seconds
[12:25:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 44.22 seconds
[12:25:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
[12:25:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 44.27 seconds
[12:25:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 40.27 seconds
[12:26:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 37.63 seconds
[12:26:29] <addshore>	 I guess it is tracked in https://phabricator.wikimedia.org/T179536
[12:26:54] <wikibugs>	 (03CR) 10jenkins-bot: Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch)
[12:26:56] <wikibugs>	 (03CR) 10jenkins-bot: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) (owner: 10Reedy)
[12:27:08] <dcausse>	 addshore: is it https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/484352 ?
[12:27:32] <addshore>	 dcausse: thats the one
[12:27:41] <Reedy>	 Looks like they've merged it for .13
[12:27:43] <Reedy>	 But forgot about .12
[12:27:58] <dcausse>	 revert perhaps? this is not supposed to be merged without being deployed
[12:28:01] <Reedy>	 Nope
[12:28:09] <Hauskatze>	 well, CN has a curious way to deploy stuff :)
[12:28:10] <addshore>	 well, i guess it is 1/2 deployed
[12:28:10] <Reedy>	 Because if we revert it, we change .13 too
[12:28:29] <addshore>	 I don't particularly want to deploy it for .12 myself
[12:28:39] <icinga-wm>	 PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:28:41] <Reedy>	 don't update the submodule then?
[12:28:50] <Reedy>	 and/or don't sync the whole extensions tree
[12:29:17] <addshore>	 well, when I do my "git rebase" im just gonna end up hiding the change right?
[12:29:28] <dcausse>	 yes, we'll be out of sync, someone will have to take care of it
[12:29:33] <addshore>	 and then if someone ends up doing a full sync they may be syncing it without knowing
[12:29:35] <Reedy>	 Do you actually need to rebase anything?
[12:29:41] <Reedy>	 Well, no they won't
[12:29:49] <Reedy>	 unless you do git submodule update extensions/CentralNotice
[12:29:52] <dcausse>	 the submodule should be visible?
[12:29:53] <Reedy>	 The staged code isn't going to change
[12:31:25] <addshore>	 Reedy: well, i still need to get my change (which is currently on top of the CN one) into the actual tree
[12:31:31] <Reedy>	 Right
[12:31:50] <Reedy>	 don't update the CN git submodule
[12:31:51] <icinga-wm>	 PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:31:53] <Reedy>	 It's easy to avoid
[12:32:04] <addshore>	 do tell :)
[12:32:19] <addshore>	 i generally avoid doing things well deploying unless they are written down somewhere ;)
[12:32:21] <icinga-wm>	 RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:32:24] <Reedy>	 don't run git submodule update --init --recursive
[12:32:32] <Reedy>	 just do git submodule update extensions/Whichever
[12:32:38] <addshore>	 thats all i normally do
[12:32:55] <addshore>	 but in the current state, that won't even update my submodule right? WikibaseQualityConstraints ?
[12:33:09] <addshore>	 unless i do something with my currently only fetched commit
[12:33:48] <addshore>	 or am I misunderstanding that and the submodule update would actually update it? (even without the rebase or something else)
[12:33:57] <Reedy>	 If you explicitly tell it to update your submodule... after git pull/fetch/rebase/whatever
[12:33:59] <Reedy>	 It'll update it
[12:34:02] <Reedy>	 It won't update CN
[12:34:11] <dcausse>	 addshore: just git rebase, the CN patch will be seen in git status because you'll just run git submodule update WikibaseQualityConstraints 
[12:34:33] <wikibugs>	 (03PS1) 10Jbond: update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661
[12:35:07] <addshore>	 so, after the fetch Reedy the submodule update did not update the code
[12:35:14] <Reedy>	 it won't
[12:35:19] <Reedy>	 because your commit isn't there
[12:35:29] <Reedy>	 fetch only does magic things in the background
[12:35:33] <Reedy>	 it doesn't change local HEAD
[12:36:23] <wikibugs>	 (03CR) 10Muehlenhoff: update changelog and add gitignore file (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 (owner: 10Jbond)
[12:36:47] <icinga-wm>	 RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[12:37:05] <addshore>	 dcausse: one
[12:37:36] <dcausse>	 ?
[12:37:47] <Reedy>	 	modified:   extensions/CentralNotice (new commits)
[12:37:49] <addshore>	 "just git rebase, the CN patch will be seen in git status because you'll just run git submodule update WikibaseQualityConstraints"
[12:37:53] <Reedy>	 Right
[12:38:05] <addshore>	 right, going to sync them
[12:38:07] <Reedy>	 with git status showing ^ people know something isn't right and shouldn't touch it
[12:38:49] <wikibugs>	 (03PS2) 10Jbond: update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661
[12:39:09] <addshore>	 Reedy: ack
[12:39:23] <addshore>	 urgf, i hate that silly CN thing
[12:39:50] <wikibugs>	 (03PS3) 10Jbond: update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661
[12:39:56] <AndyRussG>	 addshore Reedy sorry!!!
[12:39:58] <logmsgbot>	 !log addshore@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/WikibaseQualityConstraints: [[gerrit:484654]] T204031 T204022 Fix constraintsRunCheck Job class & test (duration: 00m 57s)
[12:40:01] <Reedy>	 Aha
[12:40:01] <Hauskatze>	 I guess we should ask them to use standard wmf branches
[12:40:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:02] <stashbot>	 T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031
[12:40:03] <stashbot>	 T204022: Add functionality to run QualityConstraint checks on an entity after every edit - https://phabricator.wikimedia.org/T204022
[12:40:08] <AndyRussG>	 Is this the train?
[12:40:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 (owner: 10Jbond)
[12:40:09] <Reedy>	 Hauskatze: They don't for various reasons
[12:40:20] <Hauskatze>	 Reedy: I know, but it causes issues right?
[12:40:28] <Reedy>	 AndyRussG: the bump went out with the train (.13), which is fine
[12:40:36] <Reedy>	 The problem is .12 wasn't deployed too
[12:40:41] <Reedy>	 I just filed https://phabricator.wikimedia.org/T213915
[12:40:44] <AndyRussG>	 I'm in crazy kid school morning land
[12:40:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 (owner: 10Jbond)
[12:40:57] <Reedy>	 Hauskatze: But they don't want the branches for deployment tracking master
[12:40:59] <logmsgbot>	 !log addshore@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/WikibaseQualityConstraints: [[gerrit:484654]] T204031 T204022 Fix constraintsRunCheck Job class & test (duration: 00m 54s)
[12:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:07] <addshore>	 right, thats my 2 done!
[12:41:19] <dcausse>	 Reedy: you're done, can I go next?
[12:41:19] <Reedy>	 AndyRussG: No major rush, but obviously we don't want to just deploy it to a different MW branch unguided
[12:41:26] <Reedy>	 dcausse: Yeah, sure
[12:41:34] <dcausse>	 swating my changes
[12:42:24] <AndyRussG>	 Well we should get such branches, we'll work on better deploy sooooon
[12:42:32] <wikibugs>	 (03PS23) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381)
[12:42:34] <wikibugs>	 (03PS25) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[12:43:15] <AndyRussG>	 Reedy ok thx! Back at the keyboard pretty sooon
[12:44:07] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[12:45:10] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[12:48:45] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:49:07] <dcausse>	 sigh...
[12:49:27] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[12:49:35] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:49:37] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:49:42] <wikibugs>	 (03PS9) 10MarcoAurelio: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591)
[12:49:53] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 75754 bytes in 3.304 second response time
[12:52:03] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 623 bytes in 9.346 second response time
[12:52:03] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 624 bytes in 7.599 second response time
[12:52:36] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[12:54:35] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Just to confirm:  Date: Thursday 17th January Time: 07:00 AM UTC - 07:30 AM UTC (we expect not to use the full 30 minutes window)  **Impact: All those wikis will go read-o...
[12:54:47] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui)
[12:57:45] <jynus>	 dcausse: what is the status of swat? We have some pending maintenance that cannot wait much (blocking emergency master switch=
[12:58:08] <dcausse>	 jynus: ok
[12:58:15] <dcausse>	 I'm not done yet I need to revert
[12:58:19] <jynus>	 we don't need exclusivity
[12:58:27] <jynus>	 but we need to keep deploying stuff
[12:58:43] <dcausse>	 jynus: it's only db-*.php files?
[12:58:46] <jynus>	 yes
[12:58:50] <dcausse>	 ok
[12:58:51] <jynus>	 as usual :-)
[12:59:03] <dcausse>	 please go ahead
[12:59:06] <jynus>	 do I have your permission to do that?
[12:59:07] <jynus>	 thanks!
[12:59:12] <jynus>	 normally I wait
[12:59:16] <dcausse>	 I'll send revert patch soon, just debugging a bit more on mwdebug1002
[12:59:20] <jynus>	 but we are a bit on a schedule
[12:59:28] <dcausse>	 np! I understand
[12:59:29] <jynus>	 marostegui: ^
[12:59:44] <jynus>	 lets repool db1077
[12:59:51] <jynus>	 fully, then depool 78
[12:59:54] <marostegui>	 sounds good
[12:59:58] <jynus>	 and db1123 for later
[13:00:01] <marostegui>	 great
[13:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1300)
[13:00:38] <jynus>	 prepare the patch for 78 and all the other work while I do the repool
[13:01:09] <marostegui>	 oki!
[13:01:27] <jynus>	 it is a relatively large shift of load, so let's do it slowly
[13:01:42] <jynus>	 20K QPS from one host to other
[13:02:21] <wikibugs>	 (03PS3) 10Jcrespo: Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640
[13:02:53] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815)
[13:03:03] <jynus>	 I woudl suggest to increas db1123 load
[13:03:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) >>! In T203194#4880054, @Vgutierrez wrote: > on the Dell community forum there is a [[ https://www.dell.com/community/PowerEdge-Hardware-General/Critical-netwo...
[13:03:45] <marostegui>	 let's give it 250?
[13:03:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui)
[13:03:53] <jynus>	 ok
[13:04:08] <jynus>	 but wait to rebase on top of my revert
[13:04:12] <marostegui>	 yep, ofc
[13:04:15] <icinga-wm>	 RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK
[13:04:19] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815)
[13:04:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640 (owner: 10Jcrespo)
[13:04:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui)
[13:05:55] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640 (owner: 10Jcrespo)
[13:06:08] <jynus>	 you will need to rebase manually
[13:06:17] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 fully (duration: 00m 52s)
[13:06:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:41] <marostegui>	 yeah, I was preparing a different patchset, which is easier than fixing the conflicts manually XDDD (/me being lazy)
[13:07:09] <jynus>	 https://xkcd.com/1597/ ?
[13:07:13] <jynus>	 XD
[13:07:16] <marostegui>	 exactly XDDDD
[13:07:56] <wikibugs>	 (03PS1) 10DCausse: Revert "[cirrus] Start using replica group settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484665
[13:08:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.87 seconds
[13:08:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.65 seconds
[13:08:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.54 seconds
[13:08:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.35 seconds
[13:08:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.28 seconds
[13:08:37] <wikibugs>	 (03Abandoned) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui)
[13:08:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.73 seconds
[13:08:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.11 seconds
[13:08:58] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815)
[13:08:59] <marostegui>	 jynus: ^
[13:09:01] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) See also this email thread where Michael Chan (broadcom driver dev) asks for firmware level output, sees the same numbers we have on cp1088, and tells them to...
[13:09:07] <jynus>	 looking
[13:09:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.63 seconds
[13:09:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.21 seconds
[13:09:42] <jynus>	 you need to shift the rc and log traffic
[13:09:54] <marostegui>	 what do you mean?
[13:09:55] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "IRC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui)
[13:10:11] <dcausse>	 jynus, marostegui: I need to rebase deploy1001 with this https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/484665, lemme when it's a good time
[13:10:12] <jynus>	 doesn't it have section loads?
[13:10:23] <marostegui>	 jynus: db1078 no
[13:10:24] <jynus>	 dcausse: go on, we are reviewing
[13:10:27] <dcausse>	 ok
[13:10:49] <jynus>	 marostegui: let me double check to see what I changed before was right too
[13:10:55] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "SWAT, revert previous patch (testing failed on mwdebug1002)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484665 (owner: 10DCausse)
[13:11:03] <marostegui>	 jynus: db1078 has never had rc or vslow traffic
[13:11:27] <jynus>	 oh, I see
[13:11:38] <jynus>	 it had only during my maintenance
[13:11:45] <jynus>	 but it is now ok, sorry
[13:11:45] <marostegui>	 :)
[13:12:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "[cirrus] Start using replica group settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484665 (owner: 10DCausse)
[13:12:06] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Sorry" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui)
[13:12:09] <marostegui>	 thanks!
[13:12:16] <marostegui>	 dcausse: let me know when you are done, so I can go :)
[13:12:24] <jynus>	 as I said, and you told me so, I prefer to be pedantic
[13:12:33] <dcausse>	 !log eu SWAT done
[13:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:39] <marostegui>	 jynus: and i like it! :)
[13:12:40] <jynus>	 if I don't have it 100% clear
[13:12:41] <dcausse>	 marostegui: all done
[13:12:44] <marostegui>	 thanks!
[13:12:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui)
[13:12:55] <jynus>	 dcausse: thanks and sorry for the pressure
[13:13:04] <dcausse>	 np!
[13:13:50] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui)
[13:13:56] <jynus>	 I will reserve the windown on deployments
[13:14:13] <marostegui>	 thanks
[13:15:00] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 T209815 (duration: 00m 52s)
[13:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:05] <stashbot>	 T209815: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815
[13:15:08] <marostegui>	 !log Stop MySQL on db1078 and power it off for firmware update - T209815
[13:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:39] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: apt: repository: trust also the source repo [puppet] - 10https://gerrit.wikimedia.org/r/483140
[13:16:55] <arturo>	 moritzm: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483140/ regarding this, good to merge?
[13:17:58] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi)
[13:18:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: WIP: Remove externalIP settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/484670
[13:18:50] <wikibugs>	 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi)
[13:18:52] <wikibugs>	 (03CR) 10jenkins-bot: Revert "[cirrus] Start using replica group settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484665 (owner: 10DCausse)
[13:18:55] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui)
[13:20:04] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.71 seconds
[13:24:02] <moritzm>	 arturo: looking
[13:26:41] <wikibugs>	 (03CR) 10Muehlenhoff: apt: repository: trust also the source repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483140 (owner: 10Arturo Borrero Gonzalez)
[13:28:14] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.69 seconds
[13:28:18] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.28 seconds
[13:28:34] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.57 seconds
[13:28:38] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.37 seconds
[13:28:39] <wikibugs>	 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: [stretch] Investigate distributed and long term storage solutions for Prometheus - https://phabricator.wikimedia.org/T213918 (10fgiunchedi) p:05Triage→03Normal
[13:28:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.23 seconds
[13:28:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.97 seconds
[13:28:48] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.62 seconds
[13:28:56] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.60 seconds
[13:28:56] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.67 seconds
[13:28:56] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.71 seconds
[13:28:56] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2083 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.21 seconds
[13:29:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.24 seconds
[13:29:11] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: services: sync aptly repo [puppet] - 10https://gerrit.wikimedia.org/r/484671 (https://phabricator.wikimedia.org/T213917)
[13:30:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: services: sync aptly repo [puppet] - 10https://gerrit.wikimedia.org/r/484671 (https://phabricator.wikimedia.org/T213917) (owner: 10Arturo Borrero Gonzalez)
[13:31:42] <wikibugs>	 (03CR) 10Gehel: [V: 03+2 C: 03+2] This script has been moved to the puppet repository. [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/484641 (owner: 10Gehel)
[13:34:09] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi)
[13:36:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 432.90 seconds
[13:37:28] <wikibugs>	 (03PS12) 10Gehel: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[13:40:09] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[13:43:32] <wikibugs>	 (03CR) 10Gehel: "almost good! one missing piece: maps1003 is still tagged as jessie in modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200" [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe)
[13:43:58] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 546.24 seconds
[13:44:13] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422)
[13:45:54] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 573.31 seconds
[13:47:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo)
[13:48:11] <wikibugs>	 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Create Icinga check for failed shard allocation - https://phabricator.wikimedia.org/T212850 (10Gehel) This check has been deployed for the main cirrus clusters (eqiad+codfw).  We still need to add it for :  * psi / omega...
[13:49:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.83 seconds
[13:53:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo)
[13:54:57] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo)
[13:55:20] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1078 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858)
[13:55:40] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Put s3 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858)
[13:56:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 59.53 seconds
[13:56:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 56.06 seconds
[13:56:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 55.56 seconds
[13:56:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 55.20 seconds
[13:56:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 53.48 seconds
[13:56:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 52.19 seconds
[13:56:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 50.51 seconds
[13:57:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 50.59 seconds
[13:57:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 50.71 seconds
[13:57:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 57.43 seconds
[13:57:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 39.60 seconds
[13:57:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 39.69 seconds
[13:57:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 28.47 seconds
[13:57:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 18.88 seconds
[13:57:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 7.30 seconds
[13:57:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.23 seconds
[13:57:57] <wikibugs>	 (03CR) 10Jcrespo: "Comments" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui)
[13:58:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[13:58:10] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo)
[13:58:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[13:58:48] <onimisionipe>	 godog, herron : https://gerrit.wikimedia.org/r/c/operations/puppet/+/482297/ do you think this check will be useful for logstash? I can prepare the patch to enable it
[14:00:01] <wikibugs>	 (03PS3) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858)
[14:00:04] <jouncebot>	 Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1400)
[14:01:01] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 22 Invalid argument from storage engine TokuDB on query. Default database: outreachwiki. [Query snipped]
[14:01:13] <marostegui>	 elukey: ^
[14:01:26] <elukey>	 buuuuu
[14:01:45] <elukey>	 that is not related to my alters right?
[14:02:04] <marostegui>	 I don't think so
[14:02:16] <marostegui>	 I cannot believe how much time we are spending with host host lately :(
[14:02:34] <elukey>	 aren't you happy that you can work with me so much!??
[14:02:39] <marostegui>	 XDDD
[14:02:39] * elukey runs away
[14:03:03] <marostegui>	 I am going to conver that table to innodb
[14:04:04] <marostegui>	 great, it is failing for all the tables on that wiki
[14:04:09] <marostegui>	 I am glad it is small
[14:04:13] <wikibugs>	 (03PS3) 10Mathew.onipe: maps: migrate maps1003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622)
[14:04:14] <marostegui>	 I will fully move it to innodb
[14:05:01] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1019 with low load (duration: 00m 52s)
[14:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:47] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700
[14:10:24] <marostegui>	 elukey: I have fixed dbstore1002 
[14:10:27] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[14:10:35] <elukey>	 thanks!
[14:16:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700 (owner: 10Jcrespo)
[14:17:27] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700 (owner: 10Jcrespo)
[14:18:18] <godog>	 onimisionipe: yes please! definitely sounds useful for logstash too
[14:19:03] <onimisionipe>	 godog: alright! gimmie some min
[14:19:30] <wikibugs>	 (03PS10) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247)
[14:19:50] <wikibugs>	 (03PS11) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247)
[14:20:48] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1019 fully (duration: 00m 52s)
[14:20:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:00] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10fgiunchedi)
[14:21:02] <wikibugs>	 10Operations, 10monitoring: Adapt Kafka dashboards to use metrics from prometheus-node-exporter - https://phabricator.wikimedia.org/T207041 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done, I've removed or replaced `servers` in the Kafka dashboard! "Kafka (graphite)" still has some but that's expected
[14:24:16] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700 (owner: 10Jcrespo)
[14:29:09] <jynus>	 !log stop upgrade db1124 (this may have temp. lag on labsdb hosts for s1, s3, s5, s8)
[14:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:19] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: attempt to deploy 0.26.3-wikimedia1
[14:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:05] <icinga-wm>	 PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools]
[14:49:43] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds
[14:49:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.85 seconds
[14:49:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.72 seconds
[14:50:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.02 seconds
[14:50:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.93 seconds
[14:50:14] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga: enable check for psi and omega cluster [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850)
[14:50:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.55 seconds
[14:50:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.96 seconds
[14:50:43] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.88 seconds
[14:50:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: enable check for psi and omega cluster [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[14:50:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 323.89 seconds
[14:52:03] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data transfer in progress
[14:52:03] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data transfer in progress
[14:53:01] <wikibugs>	 (03PS2) 10Mathew.onipe: icinga: enable check for psi and omega cluster [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850)
[14:56:19] <jynus>	 !log stop upgrade db1125 (this may cause temp. lag on labsdb hosts for s7, s6, s4, s2)
[14:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:15] <wikibugs>	 10Operations: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10WMDE-leszek)
[15:02:09] <addshore>	 Reedy: re, https://phabricator.wikimedia.org/T213928 if I verify via a call I can just run the maint script right?
[15:03:06] <wikibugs>	 10Operations: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Aklapper) > If there is a need for @Matthias_Geisler_WMDE to confirm his identity, please suggest the preferred way to do it.  https://wikitech.wikimedia.org/wiki/Password_reset#Reset_two_factor_authe...
[15:04:27] <Reedy>	 addshore: Pretty much. Or any other way you can comfortably confirm who they are
[15:05:37] <wikibugs>	 10Operations: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Addshore) > If you recognize them, have a face-to-face or in a video chat. > If someone on WMF staff recognizes them, have a three-way video chat where a staffmember vouches. > Have the user write a r...
[15:06:13] <wikibugs>	 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10CDanis) a:03Addshore
[15:06:25] <cdanis>	 addshore: thanks!  assigning it to you just to get it off the ops clinic dashboard :)
[15:06:30] <addshore>	 ack!
[15:08:30] <wikibugs>	 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Addshore) Confirmed in a call, will go and reset this now
[15:09:18] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga: enable check for logstash [puppet] - 10https://gerrit.wikimedia.org/r/484685 (https://phabricator.wikimedia.org/T212850)
[15:10:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10jcrespo) 05Open→03Stalled stalling, no errors so far, but I doubt this is the last time we hear abut this. Backups are on dbstore1001 just in case.
[15:10:52] <wikibugs>	 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Addshore)
[15:11:30] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121 (10jcrespo)
[15:11:35] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Cmjohnson es1019 is back into service.
[15:12:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Cmjohnson) I ran the Service Pack on db1078, all firmware is up to date including BIOS and raid controller.  The server is currently powered off
[15:12:05] <addshore>	 !log addshore@mwmaint1002:~$ mwscript extensions/OATHAuth/maintenance/disableOATHAuthForUser.php --wiki=labswiki Matthias_Geisler // T213928
[15:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:08] <stashbot>	 T213928: Reset Wikitech 2FA access for Matthias_Geisler_WMDE  - https://phabricator.wikimedia.org/T213928
[15:12:26] <wikibugs>	 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Addshore) You should now be able to login and setup 2fa again
[15:13:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.12 seconds
[15:13:09] <icinga-wm>	 RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[15:13:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.82 seconds
[15:13:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.69 seconds
[15:13:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686
[15:13:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.58 seconds
[15:13:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.26 seconds
[15:14:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.63 seconds
[15:14:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.13 seconds
[15:14:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.07 seconds
[15:14:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.96 seconds
[15:14:27] <wikibugs>	 (03PS2) 10Mathew.onipe: icinga: enable check for logstash [puppet] - 10https://gerrit.wikimedia.org/r/484685 (https://phabricator.wikimedia.org/T212850)
[15:14:49] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686
[15:20:30] <wikibugs>	 (03PS3) 10Mathew.onipe: icinga: enable check for psi and omega cluster [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850)
[15:20:35] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational
[15:22:09] <onimisionipe>	 godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/484685
[15:23:09] <anomie>	 marostegui: Is it safe for me to restart the actor migration on s3 now?
[15:23:40] <marostegui>	 anomie: no, we are doing the failover tomorrow, so it needs to be stopped till tomorrow, sorry about that
[15:24:23] <wikibugs>	 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Matthias_Geisler_WMDE) 05Open→03Resolved
[15:24:51] <wikibugs>	 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Matthias_Geisler_WMDE) Thanks!!!!!
[15:25:12] <anomie>	 ok, s3 should still get to complete before s1 finishes even with the extra delay.
[15:25:38] <marostegui>	 anomie: I will ping you tomorrow as soon as we are completely done with the failover so you can resume once you get online
[15:26:13] <anomie>	 ok
[15:26:30] <marostegui>	 sorry :(
[15:26:43] <marostegui>	 This came kinda unexpectedly 
[15:26:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 40.17 seconds
[15:26:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 33.22 seconds
[15:26:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 33.28 seconds
[15:27:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 1.17 seconds
[15:27:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.22 seconds
[15:27:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[15:27:51] <fdans>	 a-team are we doing today the first analytics deployment train? :)
[15:28:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.52 seconds
[15:28:02] <fdans>	 the A Train
[15:28:39] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational
[15:29:37] <elukey>	 fdans: let's chat in #analytics to see if there is stuff to deploy first
[15:29:51] <fdans>	 OH SORRY
[15:33:26] <wikibugs>	 10Operations, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) p:05Triage→03Normal
[15:33:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686 (owner: 10Marostegui)
[15:34:14] <wikibugs>	 (03PS1) 10Gehel: wdqs: fix broken logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/484690
[15:34:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686 (owner: 10Marostegui)
[15:35:23] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858)
[15:35:54] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 T209815 (duration: 00m 52s)
[15:35:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:57] <stashbot>	 T209815: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815
[15:37:37] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) 05Open→03Resolved Thank you so much! The server is back in the mix.
[15:38:11] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: fix broken logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/484690 (owner: 10Gehel)
[15:39:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[15:39:32] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] wdqs: fix broken logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/484690 (owner: 10Gehel)
[15:40:31] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858)
[15:41:01] <jbond42>	 !log "Import new debdeploy 0.0.99.7 packages for stretch T207845
[15:41:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:04] <stashbot>	 T207845: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845
[15:43:04] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686 (owner: 10Marostegui)
[15:44:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Expand Coordinator.resultUp behavior on first monitor check result [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 (owner: 10Mark Bergsma)
[15:50:01] <wikibugs>	 (03PS2) 10Kosta Harlan: EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348)
[15:50:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo)
[15:51:28] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo)
[15:55:59] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo)
[15:56:08] <jbond42>	 !log Import new debdeploy 0.0.99.7 packages for jessie T207845
[15:56:08] <wikibugs>	 (03PS26) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[15:56:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:11] <stashbot>	 T207845: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845
[15:56:11] <wikibugs>	 (03PS1) 10DCausse: [cirrus] Start using replica group settings (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381)
[15:56:25] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: fix broken logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/484690 (owner: 10Gehel)
[15:57:45] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/superset/deploy@f73b897]: bump to 0.26.3-wikimedia2 with chart format string fix
[15:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:21] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/superset/deploy@f73b897]: bump to 0.26.3-wikimedia2 with chart format string fix (duration: 00m 36s)
[15:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:55] <wikibugs>	 (03PS1) 10GTirloni: wmcs::nfs::misc - Add nfsd-ldap package back [puppet] - 10https://gerrit.wikimedia.org/r/484694 (https://phabricator.wikimedia.org/T209527)
[15:59:11] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 (duration: 00m 52s)
[15:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:17] <jbond42>	 !log Import new debdeploy 0.0.99.7 packages for buster T207845
[15:59:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:03] <wikibugs>	 (03CR) 10EBernhardson: [cirrus] Start using replica group settings (take 2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[16:02:27] <wikibugs>	 (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Add nfsd-ldap package back [puppet] - 10https://gerrit.wikimedia.org/r/484694 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni)
[16:02:43] <wikibugs>	 (03PS2) 10GTirloni: wmcs::nfs::misc - Add nfsd-ldap package back [puppet] - 10https://gerrit.wikimedia.org/r/484694 (https://phabricator.wikimedia.org/T209527)
[16:02:55] <jbond42>	 !log Import new debdeploy 0.0.99.7 packages for trusty T207845
[16:02:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:58] <stashbot>	 T207845: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845
[16:08:08] <jynus>	 !log upgrade and stop db1123
[16:08:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:20] <wikibugs>	 (03Abandoned) 10Gehel: [WIP] wdqs: create multiple instances of blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel)
[16:12:06] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "looks good in principle, but can be simplified" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[16:14:03] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db1123 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484697
[16:15:01] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484685 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe)
[16:15:29] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe)
[16:21:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Comments inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[16:22:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 43.57 seconds
[16:22:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 43.64 seconds
[16:22:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 29.43 seconds
[16:22:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 16.95 seconds
[16:22:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 4.51 seconds
[16:22:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[16:23:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.15 seconds
[16:23:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.39 seconds
[16:23:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.28 seconds
[16:24:45] <wikibugs>	 10Operations, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Paladox) Reading notes from https://gitenterprise.me/2019/01/16/migrating-from-gerrit-2-15-to-2-16/  To convert you setup a vanilla gerrit site (weather it be in a separate directo...
[16:25:55] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 03+2] Expand Coordinator.resultUp behavior on first monitor check result [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 (owner: 10Mark Bergsma)
[16:25:57] <wikibugs>	 (03PS3) 10Smalyshev: Move categories namespace to second instance [puppet] - 10https://gerrit.wikimedia.org/r/484344 (https://phabricator.wikimedia.org/T213212)
[16:26:54] <wikibugs>	 (03Merged) 10jenkins-bot: Expand Coordinator.resultUp behavior on first monitor check result [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 (owner: 10Mark Bergsma)
[16:35:11] <addshore>	 jouncebot: next
[16:35:11] <jouncebot>	 In 0 hour(s) and 24 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1700)
[16:38:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1123 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484697 (owner: 10Jcrespo)
[16:39:32] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1123 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484697 (owner: 10Jcrespo)
[16:41:46] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418)
[16:42:00] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Remove externalIP settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/484670
[16:42:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.80 seconds
[16:43:06] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418)
[16:45:35] <vgutierrez>	 !log upgrading NIC firmware on cp1075 - T203194
[16:45:37] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 (duration: 00m 52s)
[16:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:38] <stashbot>	 T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194
[16:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:35] <logmsgbot>	 !log gehel@deploy1001 Started deploy [wdqs/wdqs@6685dc0]: multi instance fixes
[16:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418) (owner: 10Arturo Borrero Gonzalez)
[16:48:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.79 seconds
[16:49:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.40 seconds
[16:49:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.55 seconds
[16:49:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.26 seconds
[16:49:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.11 seconds
[16:49:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.84 seconds
[16:49:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.70 seconds
[16:50:22] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10akosiaris) Adding performance-team and core platform team per SoS recommendation to request for help.
[16:50:42] <wikibugs>	 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Milimetric) > Which major design goal would that be? /me genuinely interested  The next paragraph in T211...
[16:53:31] <icinga-wm>	 PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1075_v4, cp1075_v6
[16:53:33] <icinga-wm>	 PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1075_v4, cp1075_v6
[16:53:35] <icinga-wm>	 PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 35 not-conn: cp1075_v6
[16:53:39] <jynus>	 !log stop upgrade and restart db1112
[16:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:45] <icinga-wm>	 RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 36 ESP OK
[16:54:45] <icinga-wm>	 RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 36 ESP OK
[16:54:49] <icinga-wm>	 RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 36 ESP OK
[16:56:00] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db1123 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484697 (owner: 10Jcrespo)
[16:58:04] <wikibugs>	 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10ovasileva)
[16:58:04] <logmsgbot>	 !log gehel@deploy1001 Finished deploy [wdqs/wdqs@6685dc0]: multi instance fixes (duration: 10m 29s)
[16:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1700).
[17:00:04] <jouncebot>	 Zoranzoki21, kostajh, and dcausse: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:56] <Zoranzoki21>	 Hey
[17:00:59] <Zoranzoki21>	 Here?
[17:01:01] <dcausse>	 o/
[17:01:17] <logmsgbot>	 !log gehel@deploy1001 Started deploy [wdqs/wdqs@6685dc0]: multi instance fixes
[17:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:45] <logmsgbot>	 !log gehel@deploy1001 Finished deploy [wdqs/wdqs@6685dc0]: multi instance fixes (duration: 00m 27s)
[17:01:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:53] <kostajh>	 Here
[17:02:07] <Zoranzoki21>	 Who will SWAT
[17:02:52] <Zoranzoki21>	 (I will move my patches for next)
[17:03:05] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Great meeting!  We realized that the same problem we are encountering with the available Cloud VPS infrastructu...
[17:03:15] <dcausse>	 I can SWAT
[17:04:07] <dcausse>	 he's gone :/
[17:04:10] <dcausse>	 kostajh: around?
[17:04:36] <kostajh>	 dcausse: I'm here
[17:05:32] <vgutierrez>	 !log upgrading NIC firmware in cp1076 - T203194
[17:05:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:35] <stashbot>	 T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194
[17:06:13] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan)
[17:06:30] <wikibugs>	 (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[17:07:15] <logmsgbot>	 !log anomie@deploy1001 Synchronized php-1.33.0-wmf.12/includes/page/WikiPage.php: Add temporary logging for T210739 (duration: 00m 53s)
[17:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:20] <stashbot>	 T210739: Target deletion during page move fails - https://phabricator.wikimedia.org/T210739
[17:07:23] <wikibugs>	 (03Merged) 10jenkins-bot: EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan)
[17:08:34] <dcausse>	 kostajh: it's live on mwdebug1002, can you test there?
[17:08:48] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Reedy) Do we want MW to tag edits etc like we did for HHVM?
[17:08:58] <kostajh>	 dcausse: yes, just a few minutes please
[17:09:03] <dcausse>	 sure
[17:09:56] <wikibugs>	 (03CR) 10jenkins-bot: EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan)
[17:10:39] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch_cluster: change is_green() implementation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[17:10:41] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) >>! In T213934#4885135, @Reedy wrote: > Do we want MW to tag edits etc like we did for HHVM?  I would think so, yes.
[17:11:41] <wikibugs>	 (03CR) 10jenkins-bot: elasticsearch_cluster: change is_green() implementation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe)
[17:12:15] <kostajh>	 dcausse: my account creation request timed out, so trying again (needed in order to verify)
[17:12:37] <dcausse>	 kostajh: oh yes mwdebug1002 is a bit annoying :/
[17:12:51] <akosiaris>	 ?
[17:12:55] <akosiaris>	 how come?
[17:12:57] <kostajh>	 got it to work on second try, waiting for someone else on my team to verify
[17:13:07] <akosiaris>	 it is lacking resources? we can add more if so
[17:13:07] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:07] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:07] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:07] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:11] <icinga-wm>	 PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:17] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:17] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:21] <icinga-wm>	 PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[17:13:23] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:25] <icinga-wm>	 PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:25] <icinga-wm>	 PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:27] <dcausse>	 akosiaris: after scap it seems to struggle a lot
[17:13:27] <icinga-wm>	 PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:27] <icinga-wm>	 PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:27] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:27] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:27] <icinga-wm>	 PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:28] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:28] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:35] <icinga-wm>	 PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[17:13:35] <icinga-wm>	 PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[17:13:37] <icinga-wm>	 PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[17:13:39] <icinga-wm>	 PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[17:13:41] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6
[17:13:49] <icinga-wm>	 PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6
[17:13:51] <icinga-wm>	 PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 63 not-conn: cp1076_v6
[17:13:53] <icinga-wm>	 PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 39 not-conn: cp1076_v6
[17:13:53] <icinga-wm>	 PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 39 not-conn: cp1076_v6
[17:13:53] <icinga-wm>	 PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 39 not-conn: cp1076_v6
[17:13:53] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 39 not-conn: cp1076_v6
[17:13:55] <icinga-wm>	 PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 63 not-conn: cp1076_v6
[17:14:19] <akosiaris>	 dcausse: ah yes it's evident from https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?orgId=1&var-server=mwdebug1002&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now-1m
[17:14:22] <mutante>	 ^ eh, SAL says the firmware was upgraded just recently
[17:14:23] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK
[17:14:23] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK
[17:14:23] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK
[17:14:23] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK
[17:14:25] <mutante>	 vgutierrez: ^
[17:14:27] <icinga-wm>	 RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK
[17:14:33] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK
[17:14:33] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK
[17:14:37] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK
[17:14:39] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK
[17:14:41] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK
[17:14:41] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK
[17:14:43] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK
[17:14:43] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK
[17:14:43] <icinga-wm>	 RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK
[17:14:43] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK
[17:14:43] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK
[17:14:43] <icinga-wm>	 RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK
[17:14:44] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK
[17:14:49] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK
[17:14:51] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK
[17:14:51] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK
[17:14:53] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK
[17:14:55] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK
[17:14:56] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Jdforrester-WMF) Happy to help with this still, per IRC. :-)
[17:15:03] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK
[17:15:05] <icinga-wm>	 RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK
[17:15:09] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK
[17:15:09] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK
[17:15:09] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK
[17:15:09] <icinga-wm>	 RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK
[17:15:09] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK
[17:15:24] <kostajh>	 dcausse: all looks good
[17:15:26] <akosiaris>	 dcausse: let me know when swat is done. I can double (or even triple) the vCPUs, but I 'll need a reboot
[17:15:50] <dcausse>	 akosiaris: sure, thanks for looking into it, I'll let you know
[17:15:58] <dcausse>	 kostajh: ok deploying
[17:16:41] <kostajh>	 dcausse: thank you!
[17:18:16] <logmsgbot>	 !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EditorJourney: Enable data collection for viwiki T213348 (duration: 00m 52s)
[17:18:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:19] <stashbot>	 T213348: Understanding first day: activate for Vietnamese Wikipedia - https://phabricator.wikimedia.org/T213348
[17:18:40] <dcausse>	 kostajh: yw, (and it's live btw)
[17:19:21] <_joe_>	 yeah we need to give that machine a bit more cpu kostajh
[17:19:24] <_joe_>	 there is a ticket already
[17:19:30] <_joe_>	 sorry I never got to it
[17:19:45] <wikibugs>	 (03PS2) 10DCausse: [cirrus] Start using replica group settings (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381)
[17:19:47] <wikibugs>	 (03PS27) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[17:20:39] <vgutierrez>	 mutante: hmm it looks like the FW upgrade is too slow and the IPsec check is triggered
[17:21:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.39 seconds
[17:21:26] <mutante>	 vgutierrez: when i pinged i assumed it broke again.. only then realized you were in the middle of it. gotcha
[17:21:38] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[17:21:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.42 seconds
[17:21:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.39 seconds
[17:22:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.37 seconds
[17:22:17] <vgutierrez>	 !log rolling NIC firmware upgrade cp[1077-1080] - T203194
[17:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:20] <stashbot>	 T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194
[17:22:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.32 seconds
[17:22:43] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] Start using replica group settings (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[17:23:08] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] Start using replica group settings (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[17:25:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 372.67 seconds
[17:27:31] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 2 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Reedy) ^ Most of it done by reverting Ori's patch to remove the HHVM beta feature and then updating to match
[17:29:21] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10jcrespo) switchover script works as expected (tested on db1111/db1112):  `lang=sh, lines=10 ./switchover.py --skip-slave-move db1111 db1112 Starting preflight checks... * Original rea...
[17:29:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 434.71 seconds
[17:30:40] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) So, we use caching in MediaWiki for a...
[17:31:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 465.08 seconds
[17:31:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 465.20 seconds
[17:35:13] <logmsgbot>	 !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] Start using replica group settings (take 2) (T210381) (duration: 00m 51s)
[17:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:16] <stashbot>	 T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381
[17:35:59] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Awesome news! We have to include it on the steps list on our etherpad, which I wrote yesterday evening and needs to be reviewed by you, as it was late in the day, so error...
[17:36:53] <logmsgbot>	 !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: [cirrus] Start using replica group settings (take 2) (T210381) (duration: 00m 51s)
[17:36:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:32] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#4885223, @EvanProdromou wrote: >...
[17:39:05] <wikibugs>	 (03PS2) 10Andrew Bogott: proxyleaks.py: update for multi-region and other issues [puppet] - 10https://gerrit.wikimedia.org/r/484303
[17:39:07] <wikibugs>	 (03PS1) 10Andrew Bogott: SGE: move exec-manage script from bastions to grid masters [puppet] - 10https://gerrit.wikimedia.org/r/484713
[17:39:55] <wikibugs>	 (03PS2) 10Effie Mouzeli: role::eqiad::scb: Switch rdb1006 to redis::misc::master [puppet] - 10https://gerrit.wikimedia.org/r/484572 (https://phabricator.wikimedia.org/T213859)
[17:40:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] proxyleaks.py: update for multi-region and other issues [puppet] - 10https://gerrit.wikimedia.org/r/484303 (owner: 10Andrew Bogott)
[17:40:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] SGE: move exec-manage script from bastions to grid masters [puppet] - 10https://gerrit.wikimedia.org/r/484713 (owner: 10Andrew Bogott)
[17:42:55] <icinga-wm>	 PROBLEM - Backup of s2 in codfw on db1115 is CRITICAL: Backup for s2 at codfw taken more than 8 days ago: Most recent backup 2019-01-08 17:39:30
[17:53:13] <jynus>	 !log stop upgrade and restart db1111
[17:53:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:09] <wikibugs>	 (03PS1) 10Andrew Bogott: sge: second attempt to get exec-manage installed on the master [puppet] - 10https://gerrit.wikimedia.org/r/484715
[17:54:10] <addshore>	 jouncebot: next
[17:54:10] <jouncebot>	 In 0 hour(s) and 5 minute(s): Wikidata WikibaseQualityConstraints Job deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1800)
[17:55:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] sge: second attempt to get exec-manage installed on the master [puppet] - 10https://gerrit.wikimedia.org/r/484715 (owner: 10Andrew Bogott)
[17:55:23] <logmsgbot>	 !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@0aa107a]: Re-deploy for fixing vars.sh
[17:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:55] <wikibugs>	 (03PS1) 10Addshore: ConstraintsCheckJobs enabled on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484719 (https://phabricator.wikimedia.org/T204031)
[17:59:10] <ebernhardson>	 !next
[18:00:04] <jouncebot>	 addshore: Your horoscope predicts another unfortunate Wikidata WikibaseQualityConstraints Job deployment deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1800).
[18:00:04] <jouncebot>	 addshore: A patch you scheduled for Wikidata WikibaseQualityConstraints Job deployment is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:07] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] ConstraintsCheckJobs enabled on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484719 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[18:01:41] <wikibugs>	 (03Merged) 10jenkins-bot: ConstraintsCheckJobs enabled on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484719 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[18:01:57] <wikibugs>	 (03CR) 10jenkins-bot: ConstraintsCheckJobs enabled on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484719 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[18:02:52] <wikibugs>	 (03PS28) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[18:02:54] <wikibugs>	 (03PS1) 10DCausse: [cirrus] Enable CirrusSearchCrossClusterSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484720 (https://phabricator.wikimedia.org/T210381)
[18:03:11] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ConstraintsCheckJobs enabled on testwikidatawiki T204031 (duration: 00m 52s)
[18:03:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:14] <stashbot>	 T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031
[18:04:33] <dcausse>	 addshore: can I use a bit of your deployment time when you're done?
[18:04:43] <addshore>	 dcausse: yes
[18:05:40] <mobrovac>	 addshore: was that 100% of testwd?
[18:05:45] <addshore>	 mobrovac: yes
[18:05:47] <mobrovac>	 k
[18:06:02] <addshore>	 but with only the edit rate im trying to create now :D
[18:06:15] <dcausse>	 addshore: thanks, in fact it's not working so I won't deploy anything yet :/
[18:06:25] <addshore>	 dcausse: :(
[18:06:58] <addshore>	 mobrovac: testwikidatawiki should show up on https://grafana.wikimedia.org/d/000000105/job-queue-rate?orgId=1&var-Job=constraintsRunCheck&from=now-15m&to=now right?
[18:07:12] <logmsgbot>	 !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@0aa107a]: Re-deploy for fixing vars.sh (duration: 11m 49s)
[18:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:19] <mobrovac>	 addshore: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus
[18:07:24] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) >>! In T213859#4883927, @elukey wrote: > Ok for me for the analytics nodes, but I'd need a bit of heads up to properly stop them if possible :) >  > The Thursday...
[18:07:38] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH)
[18:08:28] <addshore>	 mobrovac: amazing, so testwikidatawiki is generaitng jobs and they are entering the queue and running? :)
[18:08:32] <addshore>	 if I'm reading that correctly
[18:08:46] <mobrovac>	 lemme double check
[18:10:17] <mobrovac>	 i see constraintsRunCheck and constraintsTableUpdate but not constraintsCheck
[18:10:18] <wikibugs>	 (03PS1) 10Addshore: ConstraintsCheckJobs enabled on testwikidatawiki (1% of edits) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484723 (https://phabricator.wikimedia.org/T204031)
[18:10:26] <addshore>	 mobrovac: its called constraintsRunCheck
[18:10:27] <addshore>	 =]
[18:10:34] <mobrovac>	 ah ok
[18:10:42] <mobrovac>	 then yes, you are reading this correctly addshore :)
[18:10:54] <addshore>	 I'll go ahead with 1% of wikidata edits then, and that is where we will leave it today :)_
[18:10:57] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] ConstraintsCheckJobs enabled on testwikidatawiki (1% of edits) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484723 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[18:11:31] <mobrovac>	 kk addshore, sounds good
[18:12:02] <wikibugs>	 (03Merged) 10jenkins-bot: ConstraintsCheckJobs enabled on testwikidatawiki (1% of edits) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484723 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[18:12:10] <dcausse>	 akosiaris: forgot to tell you that I'm done with SWAT (sorry got distracted with things not working)
[18:12:18] <mobrovac>	 addshore: for tomorrow, both Petr and I are in PST, so it would be good to continue no earlier than 17:00 UTC
[18:12:29] <addshore>	 mobrovac: ack!
[18:12:53] <addshore>	 mobrovac: the plan is just to do one of the increases per day, so tomorrow (if I do it tomorrow) would only be 1% to 5%
[18:13:07] <mobrovac>	 kk sounds good addshore
[18:13:41] <addshore>	 mobrovac: does each queue have some sort of throughput limits?
[18:13:56] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ConstraintsCheckJobs enabled on wikidatawiki (1% of edits)  T204031 (duration: 00m 51s)
[18:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:01] <stashbot>	 T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031
[18:14:03] <addshore>	 I dont really know many of the nitty gritty details of the magical job queue ;)
[18:14:12] <mobrovac>	 yup addshore, we set it not to overwhelm the jobrunners and the DB
[18:14:22] <addshore>	 mobrovac: great, whats the default?
[18:14:36] <wikibugs>	 (03CR) 10jenkins-bot: ConstraintsCheckJobs enabled on testwikidatawiki (1% of edits) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484723 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore)
[18:15:16] <mobrovac>	 addshore: https://github.com/wikimedia/mediawiki-services-change-propagation-jobqueue-deploy/blob/master/scap/vars.yaml#L101
[18:15:17] <wikibugs>	 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#4882359, @MSantos wrote: >>>! In T211881#4878426, @akosiaris wrote: >>> @akosia...
[18:15:22] <addshore>	 mobrovac: thanks!
[18:15:25] <mobrovac>	 addshore: 50 concurrent execs
[18:15:33] <mobrovac>	 addshore: what's the volume you are expecting?
[18:15:50] <mobrovac>	 rate actually, rather than volume
[18:16:23] <addshore>	 well, rate of edits is 400-1000 epm, there will be some amount of deduplication there too,
[18:16:44] <addshore>	 !log deploy slot done
[18:16:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:49] <mobrovac>	 i see
[18:17:01] <mobrovac>	 kk,  we'll likely need a special rule for this one then
[18:17:01] <addshore>	 mobrovac: I can do some more maths later today :)
[18:17:15] <addshore>	 mobrovac: the run times of the jobs can also vary greatly 
[18:18:02] <mobrovac>	 addshore: ok, then we'll definitely need a special rule for this job as i foresee some fine-tuning :)
[18:18:25] <addshore>	 mobrovac: yep! I'm looking forward to it
[18:21:01] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on asw-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi https://phabricator.wikimedia.org/T213859 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[18:21:01] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi https://phabricator.wikimedia.org/T213859 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[18:24:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "True but there's a catch. We won't have another meeting to approve such requests for another 2 weeks. I 'll escalate to mark and faidon fo" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani)
[18:33:33] <wikibugs>	 (03PS8) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366)
[18:33:44] <wikibugs>	 (03CR) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[18:48:06] <wikibugs>	 (03CR) 10Mobrovac: [C: 03+1] "PCC - https://puppet-compiler.wmflabs.org/compiler1002/14353/scandium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[18:55:59] <mutante>	 !log upgraded jenkins version for jessie and stretch in apt.wikimedia.org to latest LTS
[18:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1900)
[19:01:15] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani)
[19:02:15] <paladox>	 im here (for the gerrit upgrade) though will also be watching a vote in the uk.
[19:04:20] <thcipriani>	 !log starting gerrit upgrade to 2.15.8
[19:04:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:04] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Diff for option 3 in eqiad is:  `lang=diff [edit interfaces ae1 unit 1017 family inet] +       filter { +           output private-out4; +       } [edit interface...
[19:09:26] <logmsgbot>	 !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@cec7995]: Gerrit to 2.15.8 on gerrit2001 only
[19:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:38] <logmsgbot>	 !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@cec7995]: Gerrit to 2.15.8 on gerrit2001 only (duration: 00m 11s)
[19:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:37] <logmsgbot>	 !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@cec7995]: Gerrit to 2.15.8 on cobalt
[19:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:47] <logmsgbot>	 !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@cec7995]: Gerrit to 2.15.8 on cobalt (duration: 00m 10s)
[19:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:49] <thcipriani>	 !log restarting gerrit on cobalt for 2.15.8 upgrade
[19:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:43] <wikibugs>	 (03PS9) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366)
[19:17:57] <AndyRussG>	 addshore Reedy https://phabricator.wikimedia.org/T213915#4885559 https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/484734/
[19:20:43] <AndyRussG>	 I think the patch provided should smooth the upcoming train and remaining deploys this week?
[19:20:52] <AndyRussG>	 marxarelli: hi!! ^
[19:21:06] <AndyRussG>	 thx again 4 ur patience
[19:23:25] <brion>	 anyone getting 503s from gerrit recently? oh wait nevermind that was the deploy i bet :D
[19:23:42] <mutante>	 brion: upgrade happened a few min ago but up 
[19:23:43] <brion>	 \o/
[19:23:47] <brion>	 yep looks good now
[19:23:50] <marxarelli>	 AndyRussG: ok. was that already swat deployed to groups on wmf.12?
[19:23:50] <mutante>	 cool
[19:26:16] <marxarelli>	 AndyRussG: i'm rolling wmf.13 this week which, if i'm reading that task correctly, will incorporate the CN you want deployed
[19:27:30] <AndyRussG>	 marxarelli: mmm noo, lemme explain
[19:27:41] <AndyRussG>	 so CentralNotice is a special snowflake for deploys
[19:28:03] <AndyRussG>	 basically the submodule just points to the head of the wmf_deploy branch, always
[19:28:13] <AndyRussG>	 which we update periodically
[19:28:49] <AndyRussG>	 (this should change, btw, see https://phabricator.wikimedia.org/T136904 )
[19:29:09] <thcipriani>	 !log restarting ci jenkins for upgrade
[19:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:21] <AndyRussG>	 So this week we did an update... however the intention was to update for .13, not .12
[19:29:45] <AndyRussG>	 since it also affected .12, I made the above patch, to keep .12 at the currently deployed version of CN
[19:30:30] <AndyRussG>	 I guess it might be more relevant for SWAT deploys then, since the train deploy will only push out .13, come to think of it
[19:30:45] <AndyRussG>	 I dunno if train deploys ever do anything with the old branch
[19:31:24] <AndyRussG>	 Here is the task that I made that patch in response to: https://phabricator.wikimedia.org/T213915
[19:31:40] <wikibugs>	 (03PS6) 10Thcipriani: Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427)
[19:32:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani)
[19:32:37] <AndyRussG>	 Anyway, fwiw, wmf.12 should have CN at 63b5490 (which is what's currently deployed) and wmf.13 should be at 445deea40 (also what's currently deployed to wikis that are on wmf.13)
[19:34:33] <AndyRussG>	 Reedy yt?
[19:35:36] <AndyRussG>	 > You just need to update the git sub module for CN on deploy1001 and sync it :)
[19:35:38] <AndyRussG>	 With or without the patch (https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/484734/) ?
[19:36:08] <AndyRussG>	 Though I have deploy rights it's been ages since I did an actual deploy myself, and in truth it kinda terrifies me
[19:36:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 234.44 seconds
[19:43:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "merging because it's just a revert of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/410072/" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio)
[19:43:42] <paladox>	 hi, could someone open https://phabricator.wikimedia.org/T210785 please?
[19:44:01] <mutante>	 paladox: i can
[19:44:07] <paladox>	 thanks!
[19:44:09] <mutante>	 what do you need
[19:44:13] <thcipriani>	 ^ ?
[19:44:15] <mutante>	 just to know if i can ?
[19:44:18] <paladox>	 to change visibility to public.
[19:44:28] <paladox>	 since it was deployed.
[19:44:29] <thcipriani>	 ah, cool, thought we weren't done somehow :)
[19:44:31] <mutante>	 ah
[19:44:33] <paladox>	 heh
[19:45:39] <mutante>	 thcipriani: so you agree to make public?
[19:45:49] <thcipriani>	 +1
[19:46:06] <mutante>	 done
[19:46:38] <paladox>	 thanks :)
[19:46:40] <marxarelli>	 AndyRussG: right, so the "you just need to update the submodule for CN and sync it" is implying a swat deploy
[19:47:14] <wikibugs>	 (03PS4) 10Gehel: Move categories namespace to second instance [puppet] - 10https://gerrit.wikimedia.org/r/484344 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev)
[19:47:53] <marxarelli>	 trains roll the next versioned branch for core out, but for cherry-picks and backports for already deployed versioned branches need to be done via swat
[19:48:12] <gehel>	 !log switching wdqs categories traffic to new second instance, puppet will be disabled during the operation on all wdqs nodes - T213212
[19:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:15] <stashbot>	 T213212: Move category namespace to a separate blazegraph instance - https://phabricator.wikimedia.org/T213212
[19:48:25] <wikibugs>	 (03PS10) 10Dzahn: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio)
[19:49:30] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Move categories namespace to second instance [puppet] - 10https://gerrit.wikimedia.org/r/484344 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev)
[19:49:51] <marxarelli>	 AndyRussG: hope that makes sense. if you want help scheduling a swat for that, you can hit someone up from https://wikitech.wikimedia.org/wiki/SWAT_deploys#The_team in -releng
[19:52:00] <AndyRussG>	 marxarelli: yeee gotcha, thx!!
[19:52:13] <wikibugs>	 (03PS11) 10Dzahn: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio)
[19:52:51] <wikibugs>	 (03PS1) 10ArielGlenn: do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912)
[19:53:08] <marxarelli>	 AndyRussG: np!
[19:55:23] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1010 is CRITICAL: connect to address 10.64.32.63 and port 80: Connection refused
[19:55:31] <wikibugs>	 (03PS1) 10Gehel: wdqs: fix typo in nginx config file [puppet] - 10https://gerrit.wikimedia.org/r/484755 (https://phabricator.wikimedia.org/T213212)
[19:55:37] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:55:47] <icinga-wm>	 PROBLEM - WDQS HTTP on wdqs1010 is CRITICAL: connect to address 10.64.32.63 and port 80: Connection refused
[19:56:09] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused
[19:56:19] <gehel>	 ^ that's me, it's a test server and the fix is coming up
[19:56:44] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: fix typo in nginx config file [puppet] - 10https://gerrit.wikimedia.org/r/484755 (https://phabricator.wikimedia.org/T213212) (owner: 10Gehel)
[19:56:52] <wikibugs>	 (03PS2) 10Gehel: wdqs: fix typo in nginx config file [puppet] - 10https://gerrit.wikimedia.org/r/484755 (https://phabricator.wikimedia.org/T213212)
[19:57:11] <mutante>	 saw it, thanks 
[19:59:01] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 17141 bytes in 0.001 second response time
[19:59:15] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational
[19:59:25] <icinga-wm>	 RECOVERY - WDQS HTTP on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 17141 bytes in 0.001 second response time
[19:59:49] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.017 second response time
[20:00:03] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10bd808) >>! In T207321#4885112, @Ottomata wrote: > A. Bryan noted that we won't want to use users regular LDAP accounts fo...
[20:00:04] <jouncebot>	 marxarelli: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T2000).
[20:05:42] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) >  I was trying to figure out how this will actually look for the end user  It'd be in the JDBC connection, e.g...
[20:12:06] <wikibugs>	 (03CR) 10Dzahn: "eh.. got the duplicate deploy now. File[/srv/deployment/parsoid/deploy/deploy]" [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[20:14:24] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484759
[20:14:26] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484759 (owner: 10Dduvall)
[20:15:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.58 seconds
[20:16:01] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484759 (owner: 10Dduvall)
[20:19:05] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.13
[20:19:57] <logmsgbot>	 !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.13 (duration: 00m 51s)
[20:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:19] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484759 (owner: 10Dduvall)
[20:30:27] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:30:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:30:51] <icinga-wm>	 PROBLEM - HHVM rendering on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:32:45] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.036 second response time
[20:33:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time
[20:33:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 75751 bytes in 0.096 second response time
[20:33:43] <wikibugs>	 10Puppet, 10AbuseFilter, 10Patch-For-Review, 10User-MarcoAurelio: Check if it is safe to disable logging for purge_abusefilter.pp cron job - https://phabricator.wikimedia.org/T213591 (10MarcoAurelio) 05Open→03Resolved a:03MarcoAurelio Nothing left to do here.
[20:33:57] <wikibugs>	 10Puppet, 10AbuseFilter, 10User-MarcoAurelio: Check if it is safe to disable logging for purge_abusefilter.pp cron job - https://phabricator.wikimedia.org/T213591 (10MarcoAurelio)
[20:36:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.70 seconds
[20:38:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.68 seconds
[20:38:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.26 seconds
[20:38:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.27 seconds
[20:39:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.71 seconds
[20:39:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.71 seconds
[20:39:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.48 seconds
[20:39:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.49 seconds
[20:39:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.02 seconds
[20:41:37] <wikibugs>	 (03CR) 10DCausse: [C: 04-2] "https://github.com/elastic/elasticsearch/issues/26833" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484720 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[20:57:11] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 373 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:59:09] <Hauskatze>	 thanks for the merge mutante 
[21:00:05] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: (Dis)respected human, time to deploy Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T2100). Please do the needful.
[21:02:23] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 22 probes of 373 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:05:42] <wikibugs>	 (03PS1) 10Smalyshev: Remove $port handling - it's replaced with $endpoint now [puppet] - 10https://gerrit.wikimedia.org/r/484764
[21:05:57] <wikibugs>	 (03PS1) 10Bstorm: Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765
[21:06:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 (owner: 10Bstorm)
[21:08:00] <wikibugs>	 (03PS1) 10Bstorm: toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951)
[21:09:10] <wikibugs>	 (03PS2) 10Bstorm: Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765
[21:09:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951) (owner: 10Bstorm)
[21:10:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 (owner: 10Bstorm)
[21:10:48] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 (owner: 10Bstorm)
[21:10:56] <logmsgbot>	 !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@6b344ca]: Update mobileapps to 258d76b page summary changes
[21:10:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:00] <wikibugs>	 (03PS3) 10Bstorm: Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765
[21:12:48] <wikibugs>	 10Operations: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10Legoktm)
[21:13:03] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Remove $port handling - it's replaced with $endpoint now [puppet] - 10https://gerrit.wikimedia.org/r/484764 (owner: 10Smalyshev)
[21:13:12] <wikibugs>	 (03PS2) 10Gehel: Remove $port handling - it's replaced with $endpoint now [puppet] - 10https://gerrit.wikimedia.org/r/484764 (owner: 10Smalyshev)
[21:14:09] <wikibugs>	 (03PS2) 10Bstorm: toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951)
[21:15:40] <wikibugs>	 (03PS5) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247)
[21:15:49] <wikibugs>	 (03PS3) 10Bstorm: toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951)
[21:17:08] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951) (owner: 10Bstorm)
[21:17:26] <logmsgbot>	 !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@6b344ca]: Update mobileapps to 258d76b page summary changes (duration: 06m 31s)
[21:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:44] <wikibugs>	 (03PS6) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247)
[21:20:12] <wikibugs>	 (03PS1) 10RobH: adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771
[21:22:41] <logmsgbot>	 !log bmansurov@deploy1001 Started deploy [recommendation-api/deploy@da83637]: Update to 1a1f824
[21:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 (owner: 10RobH)
[21:26:29] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10bd808) >>! In T207321#4885829, @Ottomata wrote: >> I was trying to figure out how this will actually look for the end use...
[21:26:46] <wikibugs>	 (03PS3) 10Cwhite: mediawiki: enable statsd_exporter and add matching rules to appserver [puppet] - 10https://gerrit.wikimedia.org/r/481110 (https://phabricator.wikimedia.org/T205870)
[21:28:55] <logmsgbot>	 !log bmansurov@deploy1001 Finished deploy [recommendation-api/deploy@da83637]: Update to 1a1f824 (duration: 06m 14s)
[21:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:13] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 500 (expecting: 200)
[21:29:13] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [recommendation-api/deploy@da83637]: log
[21:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:15] <logmsgbot>	 !log ppchelko@deploy1001 deploy aborted: log (duration: 00m 02s)
[21:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 46.45 seconds
[21:29:29] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 500 (expecting: 200)
[21:29:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 23.35 seconds
[21:29:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 500 (expecting: 200)
[21:29:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 1.21 seconds
[21:29:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.49 seconds
[21:30:07] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[21:30:21] <jijiki>	  Pchelolo ^ is that you (the scb* pages)?
[21:30:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.34 seconds
[21:30:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[21:30:39] <Pchelolo>	 jijiki: that's me
[21:30:41] <Pchelolo>	 sorry
[21:30:47] <jijiki>	 hehe
[21:30:54] <Pchelolo>	 it's not a problem, it's not really exposed to anyone yet
[21:33:33] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 500 (expecting: 200)
[21:33:35] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [recommendation-api/deploy@c1b6b32]: Rollback update to 1a1f824
[21:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:36] <wikibugs>	 (03CR) 10Cwhite: "Puppet compiler link per volans' request.  10 resources added." [puppet] - 10https://gerrit.wikimedia.org/r/481110 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[21:34:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Product-Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Neil_P._Quinn_WMF)
[21:34:23] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[21:34:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy
[21:34:45] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy
[21:35:05] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) > That works for a java developer, but nobody else (including Quarry). Quarry is a Python app which is why I wa...
[21:35:21] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy
[21:35:34] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [recommendation-api/deploy@c1b6b32]: Rollback update to 1a1f824 (duration: 01m 59s)
[21:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:57] <wikibugs>	 10Operations, 10Legalpad: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10Aklapper)
[21:38:49] <wikibugs>	 (03PS2) 10ArielGlenn: do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912)
[21:48:40] <wikibugs>	 (03PS1) 10Andrew Bogott: cold-migrate: use the new nova_eqiad1 database [puppet] - 10https://gerrit.wikimedia.org/r/484774
[21:49:04] <wikibugs>	 (03PS2) 10Andrew Bogott: cold-migrate: use the new nova_eqiad1 database [puppet] - 10https://gerrit.wikimedia.org/r/484774
[21:49:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cold-migrate: use the new nova_eqiad1 database [puppet] - 10https://gerrit.wikimedia.org/r/484774 (owner: 10Andrew Bogott)
[21:52:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.38 seconds
[21:54:22] <wikibugs>	 10Operations, 10Legalpad: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10CDanis) p:05Triage→03Normal
[21:55:45] <icinga-wm>	 RECOVERY - High lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 1059 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[21:58:13] <wikibugs>	 (03PS7) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247)
[22:06:33] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10awight) We should probably decline this task in favor of {T182331}.
[22:09:04] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "Just a slight followup here. Overall LGTM." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/481914 (owner: 10Volans)
[22:13:24] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10ayounsi)
[22:15:53] <wikibugs>	 (03PS12) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247)
[22:43:25] <wikibugs>	 (03PS2) 10Dzahn: xhgui: Remove outdated clone of xhprof mirror [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle)
[22:44:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] xhgui: Remove outdated clone of xhprof mirror [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle)
[22:44:58] <icinga-wm>	 PROBLEM - Backup of s4 in eqiad on db1115 is CRITICAL: Backup for s4 at eqiad taken more than 8 days ago: Most recent backup 2019-01-08 22:35:52
[22:47:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "do you want me to also rm /srv/xhprof/profiles ?" [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle)
[22:50:56] <wikibugs>	 (03CR) 10Krinkle: "yes please, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle)
[22:51:08] <wikibugs>	 (03CR) 10Krinkle: "/srv/xhprof/ entirely actually" [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle)
[22:52:32] <wikibugs>	 (03PS1) 10Jforrester: WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781
[22:52:34] <wikibugs>	 (03PS1) 10Jforrester: [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782
[22:53:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781 (owner: 10Jforrester)
[22:53:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782 (owner: 10Jforrester)
[23:04:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 52.55 seconds
[23:05:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 52.51 seconds
[23:05:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 50.52 seconds
[23:05:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 48.47 seconds
[23:05:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 47.74 seconds
[23:05:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 48.41 seconds
[23:05:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 44.13 seconds
[23:05:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 40.74 seconds
[23:05:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 36.18 seconds
[23:08:09] <wikibugs>	 (03PS1) 10RobH: setting test to use python2 [software] - 10https://gerrit.wikimedia.org/r/484783
[23:09:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] setting test to use python2 [software] - 10https://gerrit.wikimedia.org/r/484783 (owner: 10RobH)
[23:09:15] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+2] Release 0.8 [software/certcentral] - 10https://gerrit.wikimedia.org/r/484511 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez)
[23:10:37] <Krinkle>	 !log krinkle@tungsten:/srv/: rm -rf xhprof; for T196406
[23:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:43] <stashbot>	 T196406: Decom "xhprof" viewer  - https://phabricator.wikimedia.org/T196406
[23:10:56] <wikibugs>	 (03Merged) 10jenkins-bot: Release 0.8 [software/certcentral] - 10https://gerrit.wikimedia.org/r/484511 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez)
[23:11:17] <wikibugs>	 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle)
[23:12:31] <wikibugs>	 (03CR) 10jenkins-bot: Release 0.8 [software/certcentral] - 10https://gerrit.wikimedia.org/r/484511 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez)
[23:13:24] <wikibugs>	 (03CR) 10RobH: [V: 03+2 C: 03+2] setting test to use python2 [software] - 10https://gerrit.wikimedia.org/r/484783 (owner: 10RobH)
[23:13:57] <wikibugs>	 (03PS2) 10RobH: adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771
[23:14:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 (owner: 10RobH)
[23:16:20] <wikibugs>	 (03PS1) 10RobH: Revert "setting test to use python2" [software] - 10https://gerrit.wikimedia.org/r/484786
[23:16:26] <wikibugs>	 (03CR) 10RobH: [V: 03+2 C: 03+2] Revert "setting test to use python2" [software] - 10https://gerrit.wikimedia.org/r/484786 (owner: 10RobH)
[23:21:24] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [recommendation-api/deploy@0ff39e2]: Deployment attempt with decreased worker count
[23:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:46] <wikibugs>	 (03PS1) 10RobH: trying to fix CI testing [software] - 10https://gerrit.wikimedia.org/r/484787
[23:22:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] trying to fix CI testing [software] - 10https://gerrit.wikimedia.org/r/484787 (owner: 10RobH)
[23:23:25] <wikibugs>	 (03CR) 10RobH: [V: 03+2 C: 03+2] "we expected a v-1 due to python setting we are changing via this very patchset" [software] - 10https://gerrit.wikimedia.org/r/484787 (owner: 10RobH)
[23:25:04] <wikibugs>	 (03CR) 10Paladox: "see https://github.com/wikimedia/certcentral/blob/8f316e33511707b6d871ad8e15868dfd77e32ff7/tox.ini" (034 comments) [software] - 10https://gerrit.wikimedia.org/r/484787 (owner: 10RobH)
[23:25:31] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [recommendation-api/deploy@0ff39e2]: Deployment attempt with decreased worker count (duration: 04m 08s)
[23:25:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:20] <wikibugs>	 (03PS3) 10RobH: adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771
[23:28:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 (owner: 10RobH)
[23:28:44] <wikibugs>	 (03PS1) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790
[23:28:59] <wikibugs>	 (03PS2) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790
[23:29:45] <wikibugs>	 (03PS1) 10RobH: Revert "trying to fix CI testing" [software] - 10https://gerrit.wikimedia.org/r/484791
[23:29:47] <wikibugs>	 (03PS3) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790
[23:29:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 (owner: 10Paladox)
[23:29:54] <wikibugs>	 (03CR) 10RobH: [V: 03+2 C: 03+2] Revert "trying to fix CI testing" [software] - 10https://gerrit.wikimedia.org/r/484791 (owner: 10RobH)
[23:30:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 (owner: 10Paladox)
[23:30:45] <wikibugs>	 (03PS4) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790
[23:30:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 (owner: 10Paladox)
[23:31:11] <wikibugs>	 (03Abandoned) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 (owner: 10Paladox)
[23:31:27] <wikibugs>	 (03PS1) 10Paladox: test do not merge [software] - 10https://gerrit.wikimedia.org/r/484792
[23:31:37] <wikibugs>	 (03PS2) 10Paladox: test do not merge [software] - 10https://gerrit.wikimedia.org/r/484792
[23:32:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] test do not merge [software] - 10https://gerrit.wikimedia.org/r/484792 (owner: 10Paladox)
[23:32:58] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) a:05ayounsi→03faidon Note that I also created an LibreNMS alert to monitor explicitly the mgmt network: `%syslog.msg ~ "KERN_ARP_ADDR_CHANGE" && %devices.hostname ~ "mr" && %de...
[23:33:49] <wikibugs>	 (03Abandoned) 10Paladox: test do not merge [software] - 10https://gerrit.wikimedia.org/r/484792 (owner: 10Paladox)
[23:37:21] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review: IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46 - https://phabricator.wikimedia.org/T201039 (10ayounsi) p:05High→03Normal
[23:38:03] <wikibugs>	 (03PS10) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366)
[23:42:20] <wikibugs>	 (03PS3) 10Dzahn: Add Blubber directory to releases server [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani)
[23:42:24] <wikibugs>	 (03PS1) 10Cwhite: role: add prometheus2 rules (new format) [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708)
[23:43:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add Blubber directory to releases server [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani)
[23:44:00] <wikibugs>	 (03PS2) 10Cwhite: role: add prometheus2 rules (new format) [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708)
[23:45:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed, releases1001 and releases2001 have the new dir, rsync and users" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani)
[23:46:27] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review: IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46 - https://phabricator.wikimedia.org/T201039 (10ayounsi) Reducing priority as the situation is stable. At this point I don't think the cost of upgrading the switch stacks of row B and C (full row down for ~15min) is...
[23:50:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 57.18 seconds
[23:50:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 54.26 seconds
[23:50:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 54.56 seconds
[23:50:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 53.65 seconds
[23:50:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 52.12 seconds
[23:50:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 51.08 seconds
[23:50:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 48.91 seconds
[23:53:06] <wikibugs>	 (03PS4) 10Huji: Add new synonyms for namespaces in Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733)
[23:53:08] <wikibugs>	 10Operations, 10Traffic, 10netops: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi)
[23:53:26] <wikibugs>	 (03PS4) 10Huji: Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642)
[23:53:32] <wikibugs>	 10Operations, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10Platonides) >>! In T213655#4880685, @jcrespo wrote: > Could you try to restore it @Platonides using the wiki admin tools before trying some SQL?  There is no entry to restore on the wiki.  No link stating th...
[23:53:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji)
[23:54:44] <AndyRussG>	 Hi! Is anyone SWATting? addshore hashar aude MaxSem twentyafterfour RoanKattouw Dereckson thcipriani Niharika zeljkof
[23:55:25] <AndyRussG>	 I don't have any patches specifically to deploy, but there is one to clean up CN submodule prior to deploys to wmf.12
[23:55:48] <RoanKattouw>	 The SWAT window doesn't start for another 5 minutes, and also it's empty
[23:55:55] <AndyRussG>	 RoanKattouw: yeah I saw
[23:56:06] <RoanKattouw>	 So there's still time to make it not empty ;)
[23:56:24] <AndyRussG>	 RoanKattouw: can you take maybe a peek at this to make sure it's sane? https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/484734/
[23:56:28] <wikibugs>	 10Operations, 10Traffic, 10netops: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi)
[23:56:30] <wikibugs>	 10Operations, 10netops: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi)
[23:57:09] <AndyRussG>	 as per https://phabricator.wikimedia.org/T213915
[23:58:32] <Reedy>	 AndyRussG: Checked it properly, I can confirm it's a revert of https://github.com/wikimedia/mediawiki/commit/251c06deb12a57e0c6891d047fa045697495c5ee as expected :)
[23:59:01] <AndyRussG>	 Reedy: ok cool thanks!!! Maybe now's a good time as any to merge it in?
[23:59:19] <AndyRussG>	 or I guess clean out the deploy server?
[23:59:20] <Reedy>	 Might aswell
[23:59:29] <Reedy>	 It's basically creating a noop at this point
[23:59:33] <AndyRussG>	 yeah
[23:59:35] <icinga-wm>	 PROBLEM - Backup of s6 in eqiad on db1115 is CRITICAL: Backup for s6 at eqiad taken more than 8 days ago: Most recent backup 2019-01-08 23:34:19
[23:59:39] <RoanKattouw>	 Yes I just independently concluded it's an undo of that bit merge commit
[23:59:52] <wikibugs>	 (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[23:59:55] <RoanKattouw>	 I don't have the context to understand what's going on, but if Reedy says it makes sense then I believe him
[23:59:56] <AndyRussG>	 RoanKattouw: yeah the big merge commit was meant just for wmf.13