[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:01:10] (03PS1) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 [00:01:45] mobrovac: if we can say that this defined type is only for mediawiki/services.. then this way i would say? ^ [00:02:10] it almost feels like a line similar to that existed at some point and got lost [00:05:28] (03PS2) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 [00:08:00] (03PS3) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [00:10:09] (03PS7) 10MarcoAurelio: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) [00:10:41] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:11:53] (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:12:05] (03CR) 10MarcoAurelio: "Thanks for checking the logs. If there's nothing wrong (ie: "ERROR" or similar; I don't remember the exact output) and ops are content wit" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [00:12:47] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:12:47] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:12:47] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:12:59] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:13:13] (03Abandoned) 10MarcoAurelio: Disable NewUserMessage on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482679 (owner: 10MarcoAurelio) [00:14:13] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:14:13] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:14:17] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:14:21] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:14:23] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:14:26] (03PS8) 10Dzahn: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [00:14:27] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:14:27] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:14:29] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:14:31] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:14:45] PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:14:47] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:14:47] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:14:49] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:14:49] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:14:57] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:15:03] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:15:07] hm [00:15:09] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:15:09] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:15:13] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:15:13] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:15:15] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:15:15] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:15:20] aaaaa hi [00:15:24] lol [00:15:31] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:15:42] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418 (duration: 21m 34s) [00:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:46] T212418: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 [00:15:46] heh, TBH I have mentions here muted, just thought that'd be funny [00:16:45] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 3.07 seconds [00:16:59] (03CR) 10Mobrovac: [C: 03+1] visualdiff: ensure git clone happens before creating pngs dir [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:18:10] (03PS2) 10Dzahn: visualdiff: ensure git clone happens before creating pngs dir [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) [00:19:14] (03PS2) 10Dbarratt: Enable Partial Blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza) [00:19:45] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 20.30 seconds [00:19:55] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 9.11 seconds [00:20:01] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.02 seconds [00:20:11] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [00:20:21] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [00:20:31] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:20:35] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [00:20:37] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [00:20:59] (03CR) 10Dzahn: [C: 03+2] visualdiff: ensure git clone happens before creating pngs dir [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:21:47] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:22:18] 10Operations, 10RESTBase-Cassandra, 10Services (next): restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10Pchelolo) A bit of background regarding the current appearance: - When restbase1016 failed, it started logging with an extremelly ha... [00:23:33] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:23:47] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:23:47] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:23:55] (03CR) 10Dzahn: [C: 03+2] "worked on scandium:" [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:24:49] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.51 seconds [00:25:15] re: dbstore1002 (also) https://phabricator.wikimedia.org/T206965 [00:27:13] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [00:34:43] (03PS4) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [00:34:56] (03CR) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:35:11] (03CR) 10jerkins-bot: [V: 04-1] services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:36:53] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:37:39] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.83 seconds [00:38:53] (03PS5) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [00:40:28] (03PS6) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [00:42:55] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.78 seconds [00:42:55] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.30 seconds [00:42:55] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.34 seconds [00:43:11] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.50 seconds [00:43:11] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.49 seconds [00:43:23] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.21 seconds [00:43:47] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.52 seconds [00:45:23] (03PS2) 10Thcipriani: Add Blubber directory to releases server [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) [00:46:04] (03CR) 10Thcipriani: "> One minor omission. The profile needs to be included in the" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani) [00:50:39] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [00:50:39] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave [00:50:39] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag not a slave [00:50:51] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2363.94 seconds [00:50:51] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [00:50:59] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [00:51:03] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3314.28 seconds [00:51:03] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3322.28 seconds [00:51:03] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3281.29 seconds [00:51:17] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave [00:51:17] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2751.14 seconds [00:51:17] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3331.14 seconds [00:51:17] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3146.15 seconds [00:56:45] ACKNOWLEDGEMENT - High lag on wdqs2005 is CRITICAL: 8639 ge 3600 Stas Malychev db reload, will catch up soon https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:57:27] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.68 seconds [00:57:41] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.99 seconds [00:57:46] (03PS9) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [00:57:51] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.12 seconds [00:58:13] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.13 seconds [00:58:17] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.48 seconds [00:58:25] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.03 seconds [00:58:29] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.72 seconds [00:58:39] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.51 seconds [00:58:45] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.08 seconds [00:59:04] 10Operations, 10RESTBase-Cassandra, 10Services (next): restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10mobrovac) >>! In T212424#4883122, @Pchelolo wrote: > As I understand, the driver does not recognize a node being marked as DOWN by Ca... [01:02:15] (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:02:43] !log repooled wdqs200[45] for now, 2006 still not done, will get to it later today [01:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:54] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/14350/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani) [01:16:06] (03PS7) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [01:17:30] (03CR) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:35:25] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:41:03] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.00 seconds [01:41:07] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.27 seconds [01:41:15] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.27 seconds [01:41:15] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.80 seconds [01:41:37] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.45 seconds [01:41:49] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.27 seconds [01:42:03] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.33 seconds [01:42:07] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.61 seconds [01:42:17] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.73 seconds [01:52:17] RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 1094 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:53:12] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10bd808) **cloudservices1004** is the hot-spare in the cloudservices100[34] cluster supporting the eqiad1-r region of our OpenStack deploy. It should be fine to perform... [01:54:09] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10bd808) [01:57:12] (03CR) 10Catrope: [C: 03+1] EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan) [01:58:25] RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 976 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:29:45] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [02:34:23] (03PS1) 10BryanDavis: toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) [02:34:51] (03CR) 10jerkins-bot: [V: 04-1] toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) (owner: 10BryanDavis) [02:37:09] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:13] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [02:40:03] (03PS2) 10BryanDavis: toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) [02:47:35] (03PS2) 10Mathew.onipe: elasticsearch_cluster: fix doc for is_green [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 [02:53:42] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: fix doc for is_green [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [03:07:47] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:45] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [03:11:25] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [03:19:14] (03PS3) 10Mathew.onipe: elasticsearch_cluster: fix doc for is_green [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 [03:19:29] (03CR) 10Mathew.onipe: elasticsearch_cluster: fix doc for is_green (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [03:22:11] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10CDanis) It's fine to simply shut down `prometheus1003`. We have a redundant machine `prometheus1004` which will continue gathering metrics and answering queries. `pr... [03:26:19] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.06 seconds [03:26:33] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.91 seconds [03:26:43] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.42 seconds [03:26:59] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.46 seconds [03:27:01] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.41 seconds [03:27:13] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.56 seconds [03:27:21] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.85 seconds [03:27:25] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.56 seconds [03:29:01] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 52.11 seconds [03:29:17] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 5.49 seconds [03:29:17] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 2.58 seconds [03:29:21] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [03:29:23] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [03:29:29] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:29:43] RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [03:30:01] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:30:09] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [03:42:24] (03PS4) 10BryanDavis: toolforge: process dynamicproxy access logs [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) [03:52:35] (03CR) 10BryanDavis: "> LGTM! Has this been tested in toolforge already via cherry-pick?" [puppet] - 10https://gerrit.wikimedia.org/r/482237 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis) [04:05:01] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 58.33 seconds [04:05:05] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 52.31 seconds [04:05:19] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 45.46 seconds [04:05:25] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 39.25 seconds [04:05:29] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 37.50 seconds [04:05:37] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 28.57 seconds [04:05:49] RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 20.89 seconds [04:06:01] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 12.57 seconds [04:12:15] (03PS5) 10BryanDavis: cloud: rewrite spreadcheck.py NPRE check [puppet] - 10https://gerrit.wikimedia.org/r/483606 [04:16:07] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.39 seconds [04:19:47] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:21:45] (03CR) 10BryanDavis: "> The puppet changes seem incomplete but maybe I'm missing something" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483606 (owner: 10BryanDavis) [04:24:09] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.53 seconds [04:24:09] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.57 seconds [04:24:15] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.62 seconds [04:24:37] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.26 seconds [04:25:05] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.85 seconds [04:25:11] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.37 seconds [04:25:11] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.44 seconds [04:38:31] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.45 seconds [04:39:41] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.11 seconds [04:40:07] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.68 seconds [04:40:09] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.37 seconds [04:40:15] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.16 seconds [04:40:19] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.61 seconds [04:40:31] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.62 seconds [04:40:37] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.98 seconds [04:40:55] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.28 seconds [05:22:17] PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:28:57] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.69 seconds [05:32:05] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [05:42:37] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 58.56 seconds [05:42:47] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 56.87 seconds [05:43:11] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 51.56 seconds [05:43:17] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 50.61 seconds [05:43:25] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 47.71 seconds [05:43:27] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 47.44 seconds [05:43:31] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 47.54 seconds [05:43:41] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 47.48 seconds [05:47:30] 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10Goal, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) [05:47:33] 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10User-Johan: Lessons learned: Communicating the server switch 2018 - https://phabricator.wikimedia.org/T206649 (10Johan) 05Open→03Resolved https://office.wikimedia.org/wiki/Community_Relations_Specialists/codfw/2018_lessons [05:48:03] 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10Goal, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) 05Open→03Resolved [05:56:19] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [05:56:25] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [05:56:33] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.51 seconds [05:56:47] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:56:49] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [05:57:09] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:57:09] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.30 seconds [05:57:15] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.03 seconds [05:57:33] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:01:40] !log depooling wdq2005 and wdqs2006 for T213854 [06:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:43] T213854: Reload database on wdq2[456] from another server - https://phabricator.wikimedia.org/T213854 [06:04:41] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:04:47] RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:04:49] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:04:49] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:04:55] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:04:57] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:04:57] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:04:59] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:04:59] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:05:11] RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:05:15] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:05:17] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:05:27] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:05:27] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:05:35] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:05:39] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:05:39] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:05:39] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table arwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000312, end_log_pos 765943209 [06:05:39] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:06:11] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war [06:06:15] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:06:25] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2006 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [06:06:46] !log Deploy schema change on db1067 (s1 primary master) - T85757 [06:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:48] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [06:07:08] !log started transfer wdqs2005->2006 [06:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:20] 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Marostegui) >>! In T213422#4882368, @jcrespo wrote: > es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthin... [06:10:12] (03PS1) 10Marostegui: wmnet: Re-point m3-master to dbproxy1008 [dns] - 10https://gerrit.wikimedia.org/r/484611 (https://phabricator.wikimedia.org/T213865) [06:11:57] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 47.33 seconds [06:16:26] 10Operations, 10DBA: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Don't worry, as soon as we arrange a date/time, I will stop it, so we are sure that no lag will happen before the failover. I will leave the screen running and just kill the process so you can... [06:23:23] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2006 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war [06:23:35] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [06:27:45] 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Kelson) I'm not sure to fully understand the technical explanation. Is the problem confirmed? If "yes", what is the plan to sol... [06:28:29] 10Operations, 10ops-eqiad, 10DBA: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) Chris, can we do this today as this host will be the future s3 primary master? [06:29:11] PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:30:41] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:31:21] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:55] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mwrepl] [06:33:21] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:36:03] RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [06:37:33] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:43:41] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis: Duplicate entry 29911752-enwiki for key echo_unread_wikis_user_wiki, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000312, end_log_pos 795338681 [06:44:53] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:46:29] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 30376.97 seconds [06:47:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 71061.26 seconds [06:49:31] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.37 seconds [06:49:47] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.12 seconds [06:49:51] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.50 seconds [06:50:05] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.67 seconds [06:50:09] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.16 seconds [06:50:15] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.19 seconds [06:50:25] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.27 seconds [06:50:29] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.13 seconds [06:54:39] (03PS1) 10Marostegui: mariadb: Promote db1078 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858) [06:56:45] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:59] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:00] (03CR) 10Tulsi Bhagat: "Requires `namespaceDupes.php --wiki=zhwikiversity --fix` after deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [06:58:11] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.20 seconds [06:58:47] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:59:16] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [06:59:18] (03PS1) 10Marostegui: db-eqiad.php: Put s3 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) [06:59:27] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:28] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [07:00:25] RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:32] (03PS1) 10Marostegui: db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) [07:05:58] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [07:14:27] !log Upgrade MySQL on db2050 and db2036 [07:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:13] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [07:27:02] !log Drop table tag_summary from s2 - T212255 [07:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:05] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [07:36:13] !log powercycling cp1088 - T203194 [07:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:17] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [07:39:02] (03CR) 10Hashar: [C: 03+2] "That is a good one :)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484587 (owner: 10Thcipriani) [07:39:04] (03CR) 10Hashar: [V: 03+2 C: 03+2] Fix deploy_artifacts.py [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484587 (owner: 10Thcipriani) [07:41:37] RECOVERY - Host cp1088 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [07:41:39] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [07:41:41] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK [07:41:45] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK [07:41:45] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK [07:41:45] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK [07:41:47] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [07:41:49] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [07:41:51] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK [07:41:53] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK [07:41:53] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK [07:41:53] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK [07:41:53] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK [07:41:53] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK [07:41:55] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK [07:41:55] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK [07:41:55] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK [07:42:07] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK [07:42:13] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK [07:42:15] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [07:42:15] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [07:42:15] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [07:42:15] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [07:42:17] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK [07:42:17] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [07:42:22] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK [07:42:25] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK [07:42:29] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 40 ESP OK [07:42:33] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [07:42:37] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [07:42:37] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK [07:42:39] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK [07:42:39] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK [07:42:41] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 40 ESP OK [07:42:43] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK [07:42:45] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK [07:42:45] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK [07:47:11] (03PS7) 10Wangql: Modifying configuration about Chinese Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) [07:51:06] (03CR) 10Wangql: [C: 03+1] "> Patch Set 7: Verified+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [07:53:17] ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure. [07:53:17] ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1033 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure. [07:53:17] ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure. [07:53:17] ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1035 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure. [07:53:17] ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti1007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure. [07:53:18] ACKNOWLEDGEMENT - IPMI Sensor Status on graphite1003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure. [07:53:18] ACKNOWLEDGEMENT - IPMI Sensor Status on kubernetes1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Giuseppe Lavagetto eqiad rack A3 pdu failure. [07:54:44] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) cp1088 has been affected as well after the kernel upgrade [08:01:57] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 120.76 seconds [08:07:52] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10MoritzMuehlenhoff) [08:11:52] !log drop unneeded tables from the staging db on dbstore1002 according to T212493#4883535 [08:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:56] T212493: Clean up staging db - https://phabricator.wikimedia.org/T212493 [08:15:19] !log Upgrade MySQL on db2043 (s3 codfw master) [08:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:44] !log depool codfw zotero for helm release cleanups [08:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:05] !log convert aria tables to innodb on dbstore1002 - T213706 [08:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:08] T213706: Convert Aria tables to InnoDB on dbstore1002 - https://phabricator.wikimedia.org/T213706 [08:20:45] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10MoritzMuehlenhoff) [08:21:43] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10MoritzMuehlenhoff) [08:22:07] (03CR) 10星耀晨曦: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [08:23:55] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table ruwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000312, end_log_pos 967590890 [08:24:31] !log Drop table tag_summary from s4 - T212255 [08:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:34] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [08:25:21] !log akosiaris@deploy1001 scap-helm zotero install -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [08:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:23] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [08:25:23] !log akosiaris@deploy1001 scap-helm zotero finished [08:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:27] (03CR) 10Muehlenhoff: "That's a misunderstanding; I told you that npm is now in stretch-backports, but the internally managed component/package is nodejs." [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [08:27:11] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet are marked down but pooled [08:30:19] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled [08:30:34] expected ^ [08:30:42] !log akosiaris@deploy1001 scap-helm zotero install -n production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [08:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:44] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [08:30:44] !log akosiaris@deploy1001 scap-helm zotero finished [08:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:24] I'll be moving all mw logging to kafka shortly btw [08:32:32] that's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/483384 [08:37:15] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 976.09 seconds [08:37:39] (03PS2) 10Filippo Giunchedi: Default production logging to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483384 (https://phabricator.wikimedia.org/T211124) [08:37:53] (03CR) 10Filippo Giunchedi: [C: 03+2] Default production logging to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483384 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [08:38:03] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 55.03 seconds [08:38:15] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 54.14 seconds [08:38:15] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 54.18 seconds [08:38:21] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 53.44 seconds [08:38:27] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 52.76 seconds [08:38:27] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 51.98 seconds [08:38:39] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:38:51] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 47.70 seconds [08:38:59] (03Merged) 10jenkins-bot: Default production logging to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483384 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [08:40:16] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [08:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:18] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [08:40:18] !log akosiaris@deploy1001 scap-helm zotero finished [08:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:12] !log filippo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Default to new logging infrastructure - T211124 (duration: 01m 05s) [08:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:16] T211124: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 [08:42:23] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis: Duplicate entry 56881087-eswiki for key echo_unread_wikis_user_wiki, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000312, end_log_pos 972526823 [08:42:49] duplicate key? [08:43:01] dbstore1002 is screwed [08:43:06] it had a crash the last few days [08:43:11] so don't pay too much attention to it [08:43:17] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [08:43:20] ok [08:43:27] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 259.48 seconds [08:43:37] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:43:51] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [08:44:02] ah ha [08:44:05] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10MoritzMuehlenhoff) I've removed Balazs from pwstore. [08:44:10] it needs to be replaced asap anyways [08:44:25] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10MoritzMuehlenhoff) [08:47:55] !log repool zotero in codfw [08:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:33] (03CR) 10Jcrespo: [C: 03+1] wmnet: Re-point m3-master to dbproxy1008 [dns] - 10https://gerrit.wikimedia.org/r/484611 (https://phabricator.wikimedia.org/T213865) (owner: 10Marostegui) [08:51:47] (03CR) 10Marostegui: [C: 03+2] wmnet: Re-point m3-master to dbproxy1008 [dns] - 10https://gerrit.wikimedia.org/r/484611 (https://phabricator.wikimedia.org/T213865) (owner: 10Marostegui) [08:52:55] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.72 seconds [08:53:01] !log depool zotero eqiad for helm release cleanup [08:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:02] !log installing systemd security updates for stretch [08:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:05] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.46 seconds [08:53:31] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.63 seconds [08:53:37] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.48 seconds [08:53:39] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.47 seconds [08:53:45] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.99 seconds [08:53:47] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.08 seconds [08:53:55] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.07 seconds [08:54:07] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.76 seconds [08:55:40] (03CR) 10jenkins-bot: Default production logging to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483384 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [08:57:01] !log Re-point m3-master from dbproxy1003 to dbproxy1008 - T213865 [08:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:04] T213865: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 [08:57:38] I have moved m3-master to a different dbproxy, if you notice something strange with phabricator please let me know (T213865) [08:58:06] !log akosiaris@deploy1001 scap-helm zotero install -n production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [08:58:07] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [08:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:07] !log akosiaris@deploy1001 scap-helm zotero finished [08:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:53] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet are marked down but pooled [08:59:01] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes1001.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled [08:59:14] expected ^ should recover in the next 1min or os [08:59:16] so* [09:00:05] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [09:00:15] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [09:00:32] !log test roll-restart rsyslog on mw hosts in eqiad - T211124 [09:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:34] T211124: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 [09:00:55] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:03:19] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:06:41] PROBLEM - High lag on wdqs2006 is CRITICAL: 1.101e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:06:45] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table wikishared.echo_unread_wikis: Duplicate entry 320104-enwiki for key echo_unread_wikis_user_wiki, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000312, end_log_pos 1000755095 [09:07:48] i think we need to reimport that table [09:08:38] 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) a:03Marostegui This has been done [09:08:46] is it a drop + replicate from scratch? [09:09:23] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [09:09:35] we have to stop that replication thread in sync with another host, and mysqldump that specific table [09:10:39] !log T210381: elasticsearch search cluster, creating completion suggester indices on psi&omega elastic instances in eqiad&codfw [09:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:42] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [09:10:46] ah oko makes sense [09:15:19] PROBLEM - configured eth on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [09:15:47] PROBLEM - very high load average likely xfs on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [09:16:16] !log Stop replication in sync on dbstore1002:x1 and db2034 - T213670 [09:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [09:17:09] !log powercycle ms-be1016 - T213856 [09:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:12] T213856: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 [09:17:35] PROBLEM - Disk space on wdqs2006 is CRITICAL: DISK CRITICAL - free space: /srv 53055 MB (3% inode=99%) [09:17:49] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:18:03] PROBLEM - Host ms-be1016 is DOWN: PING CRITICAL - Packet loss = 100% [09:18:19] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [09:18:55] RECOVERY - configured eth on ms-be1034 is OK: OK - interfaces up [09:20:19] RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:20:21] RECOVERY - Host ms-be1016 is UP: PING WARNING - Packet loss = 73%, RTA = 1.20 ms [09:20:37] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [09:20:37] RECOVERY - very high load average likely xfs on ms-be1034 is OK: OK - load average: 66.63, 73.16, 68.29 [09:20:45] RECOVERY - Disk space on ms-be1016 is OK: DISK OK [09:20:51] RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:24:57] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:25:07] ok ms-be1016 is back, raid controller was unhappy, whereas ms-be1034 seems back in line too [09:29:34] !log Stop s3 actor-migration script in order to allow s3 to catch up and to avoid lag during the failover - T188327 T213858 [09:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:38] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [09:29:39] T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 [09:31:14] 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @Anomie I have stopped the script as we are most likely going to go ahead with the failover in EU morning (still waiting for the managers to confirm) [09:32:25] RECOVERY - Disk space on wdqs2006 is OK: DISK OK [09:33:45] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.90 seconds [09:33:47] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.40 seconds [09:33:53] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.92 seconds [09:34:09] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.33 seconds [09:34:11] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.17 seconds [09:34:13] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.25 seconds [09:34:15] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.37 seconds [09:34:21] PROBLEM - WDQS HTTP Port on wdqs2006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time [09:34:25] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.40 seconds [09:34:27] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.31 seconds [09:34:59] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war [09:35:05] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2006 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [09:37:59] (03PS7) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [09:38:01] (03PS1) 10Jcrespo: mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) [09:38:03] (03PS1) 10Addshore: BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031) [09:38:05] (03PS1) 10Addshore: testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031) [09:38:07] (03PS1) 10Addshore: testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031) [09:38:44] (03PS1) 10Addshore: wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031) [09:39:06] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo) [09:39:15] (03PS1) 10Addshore: wikidata: post edit constraint jobs on 5% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484625 (https://phabricator.wikimedia.org/T204031) [09:39:21] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo) [09:40:32] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [09:40:38] (03Merged) 10jenkins-bot: mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo) [09:41:28] (03PS1) 10Addshore: wikidata: post edit constraint jobs on 10% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484629 (https://phabricator.wikimedia.org/T204031) [09:41:30] (03PS1) 10Addshore: wikidata: post edit constraint jobs on 25% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484630 (https://phabricator.wikimedia.org/T204031) [09:42:12] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) pc1004 can be (and should be) powered off. That host is ready for decommissioning, I have not powered off myself as I am not sure if Chris is wiping disks... [09:42:25] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 (duration: 00m 52s) [09:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:31] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [09:42:36] (03PS2) 10Addshore: wikidata: post edit constraint jobs on 25% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484630 (https://phabricator.wikimedia.org/T204031) [09:43:44] (03PS1) 10Addshore: wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484633 (https://phabricator.wikimedia.org/T204031) [09:44:49] (03PS1) 10Addshore: wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484635 (https://phabricator.wikimedia.org/T204031) [09:45:31] (03PS2) 10Addshore: wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484635 (https://phabricator.wikimedia.org/T204031) [09:46:34] jouncebot: next [09:46:34] In 2 hour(s) and 13 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1200) [09:46:37] jouncebot: now [09:46:37] No deployments scheduled for the next 2 hour(s) and 13 minute(s) [09:47:25] !log upgrade and restart db1077 [09:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:22] jouncebot: refresh [09:48:24] I refreshed my knowledge about deployments. [09:48:26] jouncebot: next [09:48:26] In 1 hour(s) and 11 minute(s): WikibaseQualityConstraints post edits jobs (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1100) [09:48:36] hmm [09:49:14] jouncebot: refresh [09:49:14] (03CR) 10jenkins-bot: mariadb: Depool db1077 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484620 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo) [09:49:15] I refreshed my knowledge about deployments. [09:49:17] jouncebot: next [09:49:17] In 0 hour(s) and 10 minute(s): WikibaseQualityConstraints post edits jobs (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1000) [09:52:45] RECOVERY - WDQS HTTP Port on wdqs2006 is OK: HTTP OK: HTTP/1.1 200 OK - 449 bytes in 0.646 second response time [09:52:59] !log upgrade controller firmware on ms-be1016 - T213856 [09:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:01] T213856: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 [09:53:05] (03PS11) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) [09:53:05] PROBLEM - MariaDB Slave IO: s3 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1077.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1077.eqiad.wmnet (111 Connection refused) [09:53:13] ^ expected [09:53:23] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2006 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war [09:53:29] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [09:53:40] (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [09:53:44] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640 [09:54:41] (03CR) 10DCausse: [C: 03+1] Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [09:55:22] (03PS1) 10Gehel: This script has been moved to the puppet repository. [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/484641 [09:59:16] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640 [09:59:18] (03PS1) 10Jcrespo: mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) [10:00:04] addshore: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for WikibaseQualityConstraints post edits jobs . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1000). [10:00:04] addshore: A patch you scheduled for WikibaseQualityConstraints post edits jobs is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:41] \o [10:00:47] starting with beta ... [10:00:58] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10fgiunchedi) [10:01:13] (03PS2) 10Addshore: BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031) [10:01:18] (03CR) 10Addshore: [C: 03+2] BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:01:27] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like the raid controller freaked out, a reboot "fixed" it. I've upgraded the firmware too: https://wikitech.wikimedia.org/wiki/Platfor... [10:01:39] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 775.24 seconds [10:02:23] (03Merged) 10jenkins-bot: BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:02:37] (03CR) 10jenkins-bot: BETA wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484621 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:03:12] (03PS2) 10Addshore: testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031) [10:03:54] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: BETA ONLY, [[gerrit:484621]] (duration: 00m 52s) [10:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:05] RECOVERY - MariaDB Slave IO: s3 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes [10:04:15] (03CR) 10Addshore: [C: 03+2] testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:05:09] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo) [10:05:51] marostegui: you want to deploy that one? :) [10:06:01] (03Merged) 10jenkins-bot: testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:06:35] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) [10:06:37] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10Patch-For-Review: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed! [10:08:19] 10Operations, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe) a:03Mathew.onipe [10:08:50] 10Operations, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe) [10:13:14] !log addshore@deploy1001 sync-file aborted: testwikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 50 T204031 [[gerrit:484621]] (duration: 00m 00s) [10:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:17] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [10:13:49] (03PS2) 10Addshore: testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031) [10:14:10] (03CR) 10Addshore: [C: 03+2] testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:14:13] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: testwikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 50 T204031 [[gerrit:484621]] (duration: 00m 52s) [10:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:59] PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [10:15:14] (03Merged) 10jenkins-bot: testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:15:36] (03CR) 10jenkins-bot: testwikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484622 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:15:38] (03CR) 10jenkins-bot: testwikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484623 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:16:11] RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time [10:16:47] (03PS8) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [10:17:03] (03CR) 10jerkins-bot: [V: 04-1] write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) (owner: 10ArielGlenn) [10:18:16] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [10:18:20] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed with mediawiki logging fully switched to new logging infra: {F27909234} [10:18:41] !log addshore@deploy1001 sync-file aborted: testwikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 100 T204031 [[gerrit:484621]] (duration: 00m 02s) [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:44] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [10:18:47] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.06 seconds [10:19:41] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: testwikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 100 T204031 [[gerrit:484621]] (duration: 00m 52s) [10:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:46] !log executed kafka preferred-replica-election on the logging Kafka cluster as attempt to spread load more uniformly [10:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:29] (03PS2) 10Addshore: wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031) [10:23:34] (03CR) 10Addshore: [C: 03+2] wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:24:09] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [10:24:41] (03Merged) 10jenkins-bot: wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:25:08] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 with low load (duration: 00m 51s) [10:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:04] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolv... [10:26:34] several db errors on testwikidata, addshore FYI [10:26:41] jynus: y*checking* [10:26:56] https://logstash.wikimedia.org/goto/ea0517f74f8810ac0572c307e8552cc3 [10:27:37] hmm, thats that deadlock *find ticket* [10:27:53] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [10:28:00] https://phabricator.wikimedia.org/T205045 [10:28:09] !log restart rsyslog on wezen, tls listener stuck [10:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:15] (03CR) 10jenkins-bot: wikidata: post edit constraint jobs on 1% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484624 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:29:09] we are going to tweak the batch size next week for that [10:29:09] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 948 days) [10:29:59] I don't think it is that [10:30:14] that exists, but it is know, etc, mostly on wikidata [10:31:09] my pointer was to testwikidata https://logstash.wikimedia.org/goto/70d5095cbec80ff162c811b9b2ce3f58 [10:31:30] but it was only a heads up as I saw you deploying, and these errors are from mwdebug [10:31:42] ack! [10:32:00] aah, the first logstash link showed me something different, let me look at these [10:32:13] sorry I was unclear [10:33:25] interesting, they all came from mwdebug1002 [10:33:28] (no need to report to me, it was just a friendly "you may have missed those") [10:33:56] and I try to be pedantic if that can cause an outage later [10:33:59] its definitely me that triggered them, but not sure how or why, I'm pretty sure they are not related to the patches I'm deploying right now [10:34:02] thanks for the poke! [10:35:05] its is perhaps just because i flicked the "log" option in the mwdebug browser extension, so we get a bunch of logs that we don't normally see in logstash for the requests that I was making [10:35:27] aaah yes, they are all DEBUG level :) [10:35:34] cool then [10:35:39] thanks! [10:35:47] (03CR) 10Volans: [C: 04-1] "A couple of minor comments inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [10:38:00] (03PS9) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [10:38:25] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wikidatawiki, wgWBQualityConstraintsEnableConstraintsCheckJobsRatio 1% T204031 [[gerrit:484621]] (duration: 00m 52s) [10:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:28] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [10:38:43] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.01 seconds [10:39:50] twentyafterfour: We have failed over dbproxy1003 to dbproxy1008 which is phabricator, I still see two connections (they have been there for hours) going thru dbproxy1003, can you restart phabricator? T213865 [10:39:50] T213865: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 [10:40:36] 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) @mmodell ` ˜/marostegui 11:39> twentyafterfour: We have failed over dbproxy1003 to dbproxy1008 which is phabricator, I still see two connections (they have been there for ho... [10:42:45] (03PS1) 10Addshore: wgWBQualityConstraintsEnableConstraintsCheckJobs true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 [10:42:56] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10jbond) 05Open→03Resolved [10:43:08] (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsEnableConstraintsCheckJobs true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 (owner: 10Addshore) [10:43:49] (03PS2) 10Addshore: wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 [10:44:13] (03PS1) 10Addshore: wgWBQualityConstraintsEnableConstraintsCheckJobs true wd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484649 [10:44:32] (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 (owner: 10Addshore) [10:45:36] (03Merged) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 (owner: 10Addshore) [10:46:29] 10Operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407 (10Nemo_bis) > There was a large discussion on irc spanning wikimedia-mailman and -ops that boils down to no one is comfortable or feels it is good practice to index lists that explic... [10:48:21] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd (duration: 00m 52s) [10:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:07] (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsEnableConstraintsCheckJobs true wd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484649 (owner: 10Addshore) [10:51:43] (03Merged) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs true wd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484649 (owner: 10Addshore) [10:52:09] twentyafterfour: nevermind my previous comment, I have killed them [10:52:09] 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) >>! In T213865#4883806, @Marostegui wrote: > @mmodell > ` > ˜/marostegui 11:39> twentyafterfour: We have failed over dbproxy1003 to dbproxy1008 which is phabricator, I still... [10:53:28] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgWBQualityConstraintsEnableConstraintsCheckJobs true wd (duration: 00m 52s) [10:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:38] (03CR) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs true testwd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484648 (owner: 10Addshore) [10:54:40] (03CR) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs true wd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484649 (owner: 10Addshore) [10:56:13] (03CR) 10Volans: [C: 03+1] "> Patch Set 1:" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond) [10:57:14] (03PS1) 10Addshore: wgWBQualityConstraintsEnableConstraintsCheckJobs false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484652 (https://phabricator.wikimedia.org/T204031) [10:57:29] (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsEnableConstraintsCheckJobs false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484652 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:58:35] (03Merged) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484652 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:59:45] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgWBQualityConstraintsEnableConstraintsCheckJobs false (duration: 00m 51s) [10:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:52] !log slot done [10:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:07] PROBLEM - DPKG on analytics1051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:00:19] PROBLEM - DPKG on analytics1068 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:00:58] (03CR) 10Addshore: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484625 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [11:02:53] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [11:02:57] !log draining kubernetes1001 for maintenance T213859 [11:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:59] T213859: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 [11:03:10] (03PS2) 10Jbond: Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) [11:04:57] (03CR) 10Volans: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond) [11:07:42] ^ analytics/dpkg should recover soon [11:08:15] (03CR) 10jenkins-bot: wgWBQualityConstraintsEnableConstraintsCheckJobs false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484652 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [11:09:59] RECOVERY - DPKG on analytics1051 is OK: All packages OK [11:10:11] RECOVERY - DPKG on analytics1068 is OK: All packages OK [11:12:37] 10Operations, 10Wikimedia-Logstash, 10User-herron: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10fgiunchedi) p:05Triage→03Normal [11:13:17] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:13:48] 10Operations, 10Wikimedia-Logstash, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) p:05Triage→03Normal [11:15:32] 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [11:15:37] 10Operations, 10Wikimedia-Logstash, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi) [11:16:08] 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [11:16:11] (03CR) 10Muehlenhoff: [C: 03+1] Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond) [11:16:12] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [11:16:34] (03PS4) 10Mathew.onipe: elasticsearch_cluster: change is_green() implementation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 [11:16:48] (03CR) 10Mathew.onipe: elasticsearch_cluster: change is_green() implementation (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [11:17:55] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [11:17:56] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi One of these approaches will be implemented as part of stretch goals of {T213157} [11:19:07] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [11:19:12] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This was completed, there will be more followup while deprecat... [11:22:16] (03CR) 10Jbond: [V: 03+2] Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond) [11:22:22] 10Operations, 10Wikimedia-Logstash: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [11:22:33] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10elukey) Ok for me for the analytics nodes, but I'd need a bit of heads up to properly stop them if possible :) The Thursday time window proposal is fine for me! [11:23:21] (03CR) 10Jbond: [V: 03+2 C: 03+2] Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond) [11:23:37] 10Operations, 10Wikimedia-Logstash, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi) [11:23:39] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) [11:24:54] 10Operations, 10Wikimedia-Logstash, 10User-herron: [stretch] Implement sensitive log access control, onboard 3 sensitive log producers - https://phabricator.wikimedia.org/T213902 (10fgiunchedi) p:05Triage→03Normal [11:25:31] (03CR) 10Vgutierrez: [C: 03+1] cookbooks.sre.hosts: improve upgrade-and-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/484422 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [11:28:40] (03CR) 10Volans: [C: 03+1] "LGTM, very minor nitpick inline." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [11:28:56] (03CR) 10Volans: [C: 03+2] cookbooks.sre.hosts: improve upgrade-and-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/484422 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [11:30:43] (03Merged) 10jenkins-bot: cookbooks.sre.hosts: improve upgrade-and-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/484422 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [11:32:54] 10Operations, 10Wikimedia-Logstash, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi) [11:32:57] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) [11:33:02] (03PS7) 10Volans: sre.hosts: add varnish upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [11:33:37] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [11:33:39] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi codfw is done, resolving in favor of {T213898} [11:43:40] 10Operations, 10monitoring, 10Graphite, 10MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), 10MW-1.27-release-notes: UDP rcvbuferrors and inerrors on graphite hosts - https://phabricator.wikimedia.org/T101141 (10fgiunchedi) [11:44:37] (03CR) 10Vgutierrez: [C: 04-1] sre.hosts: add varnish upgrade cookbook (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/480103 (https://phabricator.wikimedia.org/T205886) (owner: 10Ema) [11:52:09] (03PS5) 10Mathew.onipe: elasticsearch_cluster: change is_green() implementation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 [11:52:29] (03CR) 10Mathew.onipe: elasticsearch_cluster: change is_green() implementation (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [11:52:50] \o zeljkof [11:53:22] i have 2 backports in the swat (in 8 mins) im gonna hit +2 on them now, as the CI will likely take a pretty long time :) they will be the last 2 patches to get deployed [11:53:37] that is, if your around for swat, otherwise im talking to the wrong person! :D [11:53:45] addshore: ok [11:54:18] [= [11:57:08] 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10jcrespo) a:05Cmjohnson→03jcrespo Taking care of it. [11:59:00] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1200). [12:00:04] Thiemo_WMDE, davidwbarratt, dcausse, and addshore: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] \o [12:00:15] o/ [12:00:20] here! [12:00:25] * addshore can deploy deploy his 2 patches right at the end [= [12:00:31] o/ [12:00:51] 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Cmjohnson) While the server was down I updated, BIOS, raid firmware and hardware firmware to the latest updates [12:00:58] Thiemo_WMDE, davidwbarratt, dcausse, and addshore: the last two are deployers, right? the first two? ;) [12:01:06] I can go last, mine will take some time to test [12:01:14] (I can SWAT today for people that are not deployers) [12:01:37] ok, dcausse and addshore, I'll let you know when I'm done, so the two of you self-organize [12:01:40] thanks! [12:01:55] Thiemo_WMDE, davidwbarratt: you're not deployers? [12:01:58] 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10jcrespo) So this is solved? [12:02:16] zeljkof correct! [12:02:41] davidwbarratt: ok, you're first then, looks like Thiemo_WMDE is not around [12:02:55] CFisch_WMDE: ^^ [12:03:02] I'll let you know when your patch is at mwdebug1002, ready for testing, in a few minutes, let me know if you need help with testing there [12:03:06] (there are docs) [12:03:28] zeljkof: [12:03:31] oki eodkie [12:03:32] go for it [12:03:47] * Thiemo_WMDE is here. [12:03:49] ha, mangled that [12:05:25] (03PS3) 10Zfilipin: Enable Partial Blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza) [12:05:36] (03CR) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [12:06:20] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza) [12:07:27] (03Merged) 10jenkins-bot: Enable Partial Blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza) [12:08:20] davidwbarratt: 476884 is at mwdebug1002, please test and let me know if I can deploy it [12:08:45] testing [12:10:07] looks beautiful [12:11:01] davidwbarratt: ok to deploy? [12:11:16] zeljkof yes. :) thank you! [12:12:02] ok, deploying [12:12:27] !log upgrade and restart db1095 [12:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:52] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:476884|Enable Partial Blocks on itwiki (T210444)]] (duration: 00m 53s) [12:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:55] T210444: SWAT deploy PartialBlocks on Italian Wikipedia — Jan 16, 2019 - https://phabricator.wikimedia.org/T210444 [12:13:24] davidwbarratt: it's deployed, please test and thanks for deploying with #releng :) [12:13:42] it looks amazing! thanks! [12:13:42] Thiemo_WMDE: please stand by, your're next [12:13:50] (Gonna test it too, is it on mwdebug or already deployed?) [12:13:50] (03PS3) 10Zfilipin: Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch) [12:14:00] davidwbarratt: already deployed [12:14:01] (03CR) 10jenkins-bot: Enable Partial Blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476884 (https://phabricator.wikimedia.org/T210444) (owner: 10Dmaza) [12:14:12] Daimona: sorry, it was for you :) [12:14:15] ^ [12:14:20] zeljkof Alright, thanks :) [12:14:31] you can test here if you have permissions: https://it.wikipedia.org/wiki/Speciale:Blocca [12:14:48] note... there is a missing translation, but that word "Sitewide" has been in the repo for... over a month. [12:15:32] ? [12:15:34] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch) [12:15:46] \o/ [12:15:56] Already tested, seems to work and nothing suspicious on logstash [12:16:06] Daimona YAY! thanks! [12:16:08] Just some messages which need to be translated, I'll do that later [12:16:15] Thiemo_WMDE: just wanted to make sure you're around, your commit will be deployed to mwdebug soon [12:16:41] (03Merged) 10jenkins-bot: Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch) [12:16:42] Oh swat [12:17:18] Reedy: you have to say it with desperation in voice ;) [12:17:18] zeljkof: Can you do https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/484523/ please? [12:17:27] Daimona thank you! [12:17:36] Reedy: I was asked if we have some estimate as to when nap.wikisource could be created [12:17:44] but I couldn't answer [12:17:51] Reedy: sure, but I'm almost done, you can do it yourself in a few minutes, if you prefer [12:17:57] I don't mind [12:18:06] Hauskatze: mañana [12:18:17] Reedy: vale [12:18:36] Thiemo_WMDE: 484513 is at mwdebug1002, please test and let me know if I can deploy it cc CFisch_WMDE [12:18:48] Reedy: then please let me know if there are missing patches or something, or will you take care of them? [12:19:09] Reedy: please add your commit to the calendar [12:19:15] Reedy: do this? > addshore@deploy1001:/srv/mediawiki-staging/php-1.33.0-wmf.12$ git log HEAD..origin/wmf/1.33.0-wmf.12 :( [12:19:41] Hauskatze: As long as DNS/apache type commits are done so I can JFDI to create the wiki... [12:19:43] zeljkof: Done, confirmed. [12:19:48] Hauskatze: The other one is I think addWiki is broken still [12:19:48] davidwbarratt no prob :) [12:19:58] Thiemo_WMDE: ok to deploy? [12:20:07] Reedy: I think DNS was done and no Apache needed as nap.wikipedia already exists [12:20:10] zeljkof: Ok. [12:20:18] ok, deploying [12:20:19] let me browse the checklist though [12:20:51] addshore: wut? [12:21:08] there is what looks like a massive centralnotice commit there that I wasnt expecting? [12:21:18] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:484513|Deploy the FileExporter as a beta feature on all Wikimedia wikis (T213425)]] (duration: 00m 53s) [12:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:21] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 40.72 seconds [12:21:22] T213425: Deploy the FileExporter as a beta feature on all Wikimedia wikis - https://phabricator.wikimedia.org/T213425 [12:21:23] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 35.21 seconds [12:21:26] *goes to find it on gerrit* [12:21:29] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 27.11 seconds [12:21:37] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 1.16 seconds [12:21:39] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [12:21:43] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.02 seconds [12:21:53] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:21:56] Thiemo_WMDE: it's deployed, please test and thanks for deploying with #releng :) [12:22:07] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [12:22:11] AndyRussG: I think the CentralNotice commit I'm seeing is yours? [12:22:17] (03PS4) 10Reedy: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) [12:22:28] addshore: Is it their autobumping autotracking stuff? [12:22:32] Reedy, dcausse, addshore: I'm done, go ahead with your patches, please self organize :) [12:22:40] zeljkof: thanks [12:22:43] (03CR) 10Reedy: [C: 03+2] frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) (owner: 10Reedy) [12:22:50] zeljkof: Works as expected on live system. Thanks! [12:23:02] Reedy: not sure, the commit message is massive.. i wish these commits would stop appearing [12:23:46] (03Merged) 10jenkins-bot: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) (owner: 10Reedy) [12:24:04] Reedy: the only thing I know what to do now, is yell in here about it :P [12:25:03] 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) Yes [12:25:11] !log reedy@deploy1001 Synchronized wmf-config/throttle.php: T213848 (duration: 00m 53s) [12:25:11] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 56.29 seconds [12:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:13] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 53.15 seconds [12:25:13] T213848: Requesting temporary lift of IP cap on fr.wikipedia.org - https://phabricator.wikimedia.org/T213848 [12:25:21] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [12:25:26] 10Operations, 10DBA, 10Patch-For-Review: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) 05Open→03Resolved [12:25:31] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 51.32 seconds [12:25:31] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 49.06 seconds [12:25:45] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 44.22 seconds [12:25:45] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [12:25:45] RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 44.27 seconds [12:25:55] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 40.27 seconds [12:26:09] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 37.63 seconds [12:26:29] I guess it is tracked in https://phabricator.wikimedia.org/T179536 [12:26:54] (03CR) 10jenkins-bot: Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch) [12:26:56] (03CR) 10jenkins-bot: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) (owner: 10Reedy) [12:27:08] addshore: is it https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/484352 ? [12:27:32] dcausse: thats the one [12:27:41] Looks like they've merged it for .13 [12:27:43] But forgot about .12 [12:27:58] revert perhaps? this is not supposed to be merged without being deployed [12:28:01] Nope [12:28:09] well, CN has a curious way to deploy stuff :) [12:28:10] well, i guess it is 1/2 deployed [12:28:10] Because if we revert it, we change .13 too [12:28:29] I don't particularly want to deploy it for .12 myself [12:28:39] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:28:41] don't update the submodule then? [12:28:50] and/or don't sync the whole extensions tree [12:29:17] well, when I do my "git rebase" im just gonna end up hiding the change right? [12:29:28] yes, we'll be out of sync, someone will have to take care of it [12:29:33] and then if someone ends up doing a full sync they may be syncing it without knowing [12:29:35] Do you actually need to rebase anything? [12:29:41] Well, no they won't [12:29:49] unless you do git submodule update extensions/CentralNotice [12:29:52] the submodule should be visible? [12:29:53] The staged code isn't going to change [12:31:25] Reedy: well, i still need to get my change (which is currently on top of the CN one) into the actual tree [12:31:31] Right [12:31:50] don't update the CN git submodule [12:31:51] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:31:53] It's easy to avoid [12:32:04] do tell :) [12:32:19] i generally avoid doing things well deploying unless they are written down somewhere ;) [12:32:21] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:32:24] don't run git submodule update --init --recursive [12:32:32] just do git submodule update extensions/Whichever [12:32:38] thats all i normally do [12:32:55] but in the current state, that won't even update my submodule right? WikibaseQualityConstraints ? [12:33:09] unless i do something with my currently only fetched commit [12:33:48] or am I misunderstanding that and the submodule update would actually update it? (even without the rebase or something else) [12:33:57] If you explicitly tell it to update your submodule... after git pull/fetch/rebase/whatever [12:33:59] It'll update it [12:34:02] It won't update CN [12:34:11] addshore: just git rebase, the CN patch will be seen in git status because you'll just run git submodule update WikibaseQualityConstraints [12:34:33] (03PS1) 10Jbond: update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 [12:35:07] so, after the fetch Reedy the submodule update did not update the code [12:35:14] it won't [12:35:19] because your commit isn't there [12:35:29] fetch only does magic things in the background [12:35:33] it doesn't change local HEAD [12:36:23] (03CR) 10Muehlenhoff: update changelog and add gitignore file (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 (owner: 10Jbond) [12:36:47] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:37:05] dcausse: one [12:37:36] ? [12:37:47] modified: extensions/CentralNotice (new commits) [12:37:49] "just git rebase, the CN patch will be seen in git status because you'll just run git submodule update WikibaseQualityConstraints" [12:37:53] Right [12:38:05] right, going to sync them [12:38:07] with git status showing ^ people know something isn't right and shouldn't touch it [12:38:49] (03PS2) 10Jbond: update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 [12:39:09] Reedy: ack [12:39:23] urgf, i hate that silly CN thing [12:39:50] (03PS3) 10Jbond: update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 [12:39:56] addshore Reedy sorry!!! [12:39:58] !log addshore@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/WikibaseQualityConstraints: [[gerrit:484654]] T204031 T204022 Fix constraintsRunCheck Job class & test (duration: 00m 57s) [12:40:01] Aha [12:40:01] I guess we should ask them to use standard wmf branches [12:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:02] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [12:40:03] T204022: Add functionality to run QualityConstraint checks on an entity after every edit - https://phabricator.wikimedia.org/T204022 [12:40:08] Is this the train? [12:40:09] (03CR) 10Muehlenhoff: [C: 03+1] update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 (owner: 10Jbond) [12:40:09] Hauskatze: They don't for various reasons [12:40:20] Reedy: I know, but it causes issues right? [12:40:28] AndyRussG: the bump went out with the train (.13), which is fine [12:40:36] The problem is .12 wasn't deployed too [12:40:41] I just filed https://phabricator.wikimedia.org/T213915 [12:40:44] I'm in crazy kid school morning land [12:40:55] (03CR) 10Jbond: [C: 03+2] update changelog and add gitignore file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484661 (owner: 10Jbond) [12:40:57] Hauskatze: But they don't want the branches for deployment tracking master [12:40:59] !log addshore@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/WikibaseQualityConstraints: [[gerrit:484654]] T204031 T204022 Fix constraintsRunCheck Job class & test (duration: 00m 54s) [12:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:07] right, thats my 2 done! [12:41:19] Reedy: you're done, can I go next? [12:41:19] AndyRussG: No major rush, but obviously we don't want to just deploy it to a different MW branch unguided [12:41:26] dcausse: Yeah, sure [12:41:34] swating my changes [12:42:24] Well we should get such branches, we'll work on better deploy sooooon [12:42:32] (03PS23) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [12:42:34] (03PS25) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [12:43:15] Reedy ok thx! Back at the keyboard pretty sooon [12:44:07] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:45:10] (03Merged) 10jenkins-bot: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:48:45] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:07] sigh... [12:49:27] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [12:49:35] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:37] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:42] (03PS9) 10MarcoAurelio: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) [12:49:53] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 75754 bytes in 3.304 second response time [12:52:03] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 623 bytes in 9.346 second response time [12:52:03] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 624 bytes in 7.599 second response time [12:52:36] (03CR) 10jenkins-bot: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:54:35] 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Just to confirm: Date: Thursday 17th January Time: 07:00 AM UTC - 07:30 AM UTC (we expect not to use the full 30 minutes window) **Impact: All those wikis will go read-o... [12:54:47] 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) [12:57:45] dcausse: what is the status of swat? We have some pending maintenance that cannot wait much (blocking emergency master switch= [12:58:08] jynus: ok [12:58:15] I'm not done yet I need to revert [12:58:19] we don't need exclusivity [12:58:27] but we need to keep deploying stuff [12:58:43] jynus: it's only db-*.php files? [12:58:46] yes [12:58:50] ok [12:58:51] as usual :-) [12:59:03] please go ahead [12:59:06] do I have your permission to do that? [12:59:07] thanks! [12:59:12] normally I wait [12:59:16] I'll send revert patch soon, just debugging a bit more on mwdebug1002 [12:59:20] but we are a bit on a schedule [12:59:28] np! I understand [12:59:29] marostegui: ^ [12:59:44] lets repool db1077 [12:59:51] fully, then depool 78 [12:59:54] sounds good [12:59:58] and db1123 for later [13:00:01] great [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1300) [13:00:38] prepare the patch for 78 and all the other work while I do the repool [13:01:09] oki! [13:01:27] it is a relatively large shift of load, so let's do it slowly [13:01:42] 20K QPS from one host to other [13:02:21] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640 [13:02:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815) [13:03:03] I woudl suggest to increas db1123 load [13:03:43] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) >>! In T203194#4880054, @Vgutierrez wrote: > on the Dell community forum there is a [[ https://www.dell.com/community/PowerEdge-Hardware-General/Critical-netwo... [13:03:45] let's give it 250? [13:03:52] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui) [13:03:53] ok [13:04:08] but wait to rebase on top of my revert [13:04:12] yep, ofc [13:04:15] RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [13:04:19] (03PS2) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815) [13:04:22] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640 (owner: 10Jcrespo) [13:04:31] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui) [13:05:55] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1077 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484640 (owner: 10Jcrespo) [13:06:08] you will need to rebase manually [13:06:17] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 fully (duration: 00m 52s) [13:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:41] yeah, I was preparing a different patchset, which is easier than fixing the conflicts manually XDDD (/me being lazy) [13:07:09] https://xkcd.com/1597/ ? [13:07:13] XD [13:07:16] exactly XDDDD [13:07:56] (03PS1) 10DCausse: Revert "[cirrus] Start using replica group settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484665 [13:08:05] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.87 seconds [13:08:11] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.65 seconds [13:08:15] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.54 seconds [13:08:17] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.35 seconds [13:08:27] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.28 seconds [13:08:37] (03Abandoned) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484663 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui) [13:08:39] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.73 seconds [13:08:41] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.11 seconds [13:08:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) [13:08:59] jynus: ^ [13:09:01] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) See also this email thread where Michael Chan (broadcom driver dev) asks for firmware level output, sees the same numbers we have on cp1088, and tells them to... [13:09:07] looking [13:09:09] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.63 seconds [13:09:11] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.21 seconds [13:09:42] you need to shift the rc and log traffic [13:09:54] what do you mean? [13:09:55] (03CR) 10Jcrespo: [C: 04-1] "IRC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui) [13:10:11] jynus, marostegui: I need to rebase deploy1001 with this https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/484665, lemme when it's a good time [13:10:12] doesn't it have section loads? [13:10:23] jynus: db1078 no [13:10:24] dcausse: go on, we are reviewing [13:10:27] ok [13:10:49] marostegui: let me double check to see what I changed before was right too [13:10:55] (03CR) 10DCausse: [C: 03+2] "SWAT, revert previous patch (testing failed on mwdebug1002)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484665 (owner: 10DCausse) [13:11:03] jynus: db1078 has never had rc or vslow traffic [13:11:27] oh, I see [13:11:38] it had only during my maintenance [13:11:45] but it is now ok, sorry [13:11:45] :) [13:12:02] (03Merged) 10jenkins-bot: Revert "[cirrus] Start using replica group settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484665 (owner: 10DCausse) [13:12:06] (03CR) 10Jcrespo: [C: 03+1] "Sorry" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui) [13:12:09] thanks! [13:12:16] dcausse: let me know when you are done, so I can go :) [13:12:24] as I said, and you told me so, I prefer to be pedantic [13:12:33] !log eu SWAT done [13:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:39] jynus: and i like it! :) [13:12:40] if I don't have it 100% clear [13:12:41] marostegui: all done [13:12:44] thanks! [13:12:47] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui) [13:12:55] dcausse: thanks and sorry for the pressure [13:13:04] np! [13:13:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui) [13:13:56] I will reserve the windown on deployments [13:14:13] thanks [13:15:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 T209815 (duration: 00m 52s) [13:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:05] T209815: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 [13:15:08] !log Stop MySQL on db1078 and power it off for firmware update - T209815 [13:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:39] (03PS2) 10Arturo Borrero Gonzalez: apt: repository: trust also the source repo [puppet] - 10https://gerrit.wikimedia.org/r/483140 [13:16:55] moritzm: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483140/ regarding this, good to merge? [13:17:58] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10fgiunchedi) [13:18:05] (03PS1) 10Alexandros Kosiaris: WIP: Remove externalIP settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/484670 [13:18:50] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi) [13:18:52] (03CR) 10jenkins-bot: Revert "[cirrus] Start using replica group settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484665 (owner: 10DCausse) [13:18:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484666 (https://phabricator.wikimedia.org/T209815) (owner: 10Marostegui) [13:20:04] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.71 seconds [13:24:02] arturo: looking [13:26:41] (03CR) 10Muehlenhoff: apt: repository: trust also the source repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483140 (owner: 10Arturo Borrero Gonzalez) [13:28:14] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.69 seconds [13:28:18] PROBLEM - MariaDB Slave Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.28 seconds [13:28:34] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.57 seconds [13:28:38] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.37 seconds [13:28:39] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: [stretch] Investigate distributed and long term storage solutions for Prometheus - https://phabricator.wikimedia.org/T213918 (10fgiunchedi) p:05Triage→03Normal [13:28:40] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.23 seconds [13:28:44] PROBLEM - MariaDB Slave Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.97 seconds [13:28:48] PROBLEM - MariaDB Slave Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.62 seconds [13:28:56] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.60 seconds [13:28:56] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.67 seconds [13:28:56] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.71 seconds [13:28:56] PROBLEM - MariaDB Slave Lag: s8 on db2083 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.21 seconds [13:29:08] PROBLEM - MariaDB Slave Lag: s8 on db2079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.24 seconds [13:29:11] (03PS1) 10Arturo Borrero Gonzalez: toolforge: services: sync aptly repo [puppet] - 10https://gerrit.wikimedia.org/r/484671 (https://phabricator.wikimedia.org/T213917) [13:30:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: services: sync aptly repo [puppet] - 10https://gerrit.wikimedia.org/r/484671 (https://phabricator.wikimedia.org/T213917) (owner: 10Arturo Borrero Gonzalez) [13:31:42] (03CR) 10Gehel: [V: 03+2 C: 03+2] This script has been moved to the puppet repository. [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/484641 (owner: 10Gehel) [13:34:09] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi) [13:36:00] PROBLEM - MariaDB Slave Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 432.90 seconds [13:37:28] (03PS12) 10Gehel: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [13:40:09] (03CR) 10Gehel: [C: 03+2] Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [13:43:32] (03CR) 10Gehel: "almost good! one missing piece: maps1003 is still tagged as jessie in modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200" [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [13:43:58] PROBLEM - MariaDB Slave Lag: s8 on db2082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 546.24 seconds [13:44:13] (03PS1) 10Jcrespo: mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) [13:45:54] PROBLEM - MariaDB Slave Lag: s8 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 573.31 seconds [13:47:16] (03CR) 10Marostegui: [C: 03+1] mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [13:48:11] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Create Icinga check for failed shard allocation - https://phabricator.wikimedia.org/T212850 (10Gehel) This check has been deployed for the main cirrus clusters (eqiad+codfw). We still need to add it for : * psi / omega... [13:49:13] PROBLEM - MariaDB Slave Lag: s8 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.83 seconds [13:53:52] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [13:54:57] (03Merged) 10jenkins-bot: mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [13:55:20] (03PS2) 10Marostegui: mariadb: Promote db1078 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/484612 (https://phabricator.wikimedia.org/T213858) [13:55:40] (03PS2) 10Marostegui: db-eqiad.php: Put s3 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) [13:56:11] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 59.53 seconds [13:56:27] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 56.06 seconds [13:56:29] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 55.56 seconds [13:56:33] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 55.20 seconds [13:56:43] RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 53.48 seconds [13:56:45] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 52.19 seconds [13:56:49] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 50.51 seconds [13:57:03] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 50.59 seconds [13:57:03] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 50.71 seconds [13:57:31] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 57.43 seconds [13:57:37] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 39.60 seconds [13:57:37] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 39.69 seconds [13:57:41] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 28.47 seconds [13:57:45] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 18.88 seconds [13:57:51] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 7.30 seconds [13:57:55] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [13:57:57] (03CR) 10Jcrespo: "Comments" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) (owner: 10Marostegui) [13:58:05] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:58:10] (03CR) 10jenkins-bot: mariadb: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484673 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [13:58:37] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:58:48] godog, herron : https://gerrit.wikimedia.org/r/c/operations/puppet/+/482297/ do you think this check will be useful for logstash? I can prepare the patch to enable it [14:00:01] (03PS3) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484613 (https://phabricator.wikimedia.org/T213858) [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1400) [14:01:01] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 22 Invalid argument from storage engine TokuDB on query. Default database: outreachwiki. [Query snipped] [14:01:13] elukey: ^ [14:01:26] buuuuu [14:01:45] that is not related to my alters right? [14:02:04] I don't think so [14:02:16] I cannot believe how much time we are spending with host host lately :( [14:02:34] aren't you happy that you can work with me so much!?? [14:02:39] XDDD [14:02:39] * elukey runs away [14:03:03] I am going to conver that table to innodb [14:04:04] great, it is failing for all the tables on that wiki [14:04:09] I am glad it is small [14:04:13] (03PS3) 10Mathew.onipe: maps: migrate maps1003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) [14:04:14] I will fully move it to innodb [14:05:01] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1019 with low load (duration: 00m 52s) [14:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:47] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700 [14:10:24] elukey: I have fixed dbstore1002 [14:10:27] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:10:35] thanks! [14:16:04] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700 (owner: 10Jcrespo) [14:17:27] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700 (owner: 10Jcrespo) [14:18:18] onimisionipe: yes please! definitely sounds useful for logstash too [14:19:03] godog: alright! gimmie some min [14:19:30] (03PS10) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [14:19:50] (03PS11) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [14:20:48] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1019 fully (duration: 00m 52s) [14:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:00] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10fgiunchedi) [14:21:02] 10Operations, 10monitoring: Adapt Kafka dashboards to use metrics from prometheus-node-exporter - https://phabricator.wikimedia.org/T207041 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done, I've removed or replaced `servers` in the Kafka dashboard! "Kafka (graphite)" still has some but that's expected [14:24:16] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483700 (owner: 10Jcrespo) [14:29:09] !log stop upgrade db1124 (this may have temp. lag on labsdb hosts for s1, s3, s5, s8) [14:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:19] !log otto@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: attempt to deploy 0.26.3-wikimedia1 [14:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:05] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [14:49:43] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds [14:49:51] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.85 seconds [14:49:55] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.72 seconds [14:50:09] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.02 seconds [14:50:11] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.93 seconds [14:50:14] (03PS1) 10Mathew.onipe: icinga: enable check for psi and omega cluster [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) [14:50:23] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.55 seconds [14:50:41] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.96 seconds [14:50:43] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.88 seconds [14:50:49] (03CR) 10jerkins-bot: [V: 04-1] icinga: enable check for psi and omega cluster [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [14:50:49] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 323.89 seconds [14:52:03] ACKNOWLEDGEMENT - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data transfer in progress [14:52:03] ACKNOWLEDGEMENT - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data transfer in progress [14:53:01] (03PS2) 10Mathew.onipe: icinga: enable check for psi and omega cluster [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) [14:56:19] !log stop upgrade db1125 (this may cause temp. lag on labsdb hosts for s7, s6, s4, s2) [14:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:15] 10Operations: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10WMDE-leszek) [15:02:09] Reedy: re, https://phabricator.wikimedia.org/T213928 if I verify via a call I can just run the maint script right? [15:03:06] 10Operations: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Aklapper) > If there is a need for @Matthias_Geisler_WMDE to confirm his identity, please suggest the preferred way to do it. https://wikitech.wikimedia.org/wiki/Password_reset#Reset_two_factor_authe... [15:04:27] addshore: Pretty much. Or any other way you can comfortably confirm who they are [15:05:37] 10Operations: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Addshore) > If you recognize them, have a face-to-face or in a video chat. > If someone on WMF staff recognizes them, have a three-way video chat where a staffmember vouches. > Have the user write a r... [15:06:13] 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10CDanis) a:03Addshore [15:06:25] addshore: thanks! assigning it to you just to get it off the ops clinic dashboard :) [15:06:30] ack! [15:08:30] 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Addshore) Confirmed in a call, will go and reset this now [15:09:18] (03PS1) 10Mathew.onipe: icinga: enable check for logstash [puppet] - 10https://gerrit.wikimedia.org/r/484685 (https://phabricator.wikimedia.org/T212850) [15:10:44] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10jcrespo) 05Open→03Stalled stalling, no errors so far, but I doubt this is the last time we hear abut this. Backups are on dbstore1001 just in case. [15:10:52] 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Addshore) [15:11:30] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121 (10jcrespo) [15:11:35] 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Cmjohnson es1019 is back into service. [15:12:03] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Cmjohnson) I ran the Service Pack on db1078, all firmware is up to date including BIOS and raid controller. The server is currently powered off [15:12:05] !log addshore@mwmaint1002:~$ mwscript extensions/OATHAuth/maintenance/disableOATHAuthForUser.php --wiki=labswiki Matthias_Geisler // T213928 [15:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:08] T213928: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 [15:12:26] 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Addshore) You should now be able to login and setup 2fa again [15:13:07] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.12 seconds [15:13:09] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:13:09] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.82 seconds [15:13:21] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.69 seconds [15:13:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686 [15:13:53] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.58 seconds [15:13:57] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.26 seconds [15:14:01] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.63 seconds [15:14:01] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.13 seconds [15:14:07] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.07 seconds [15:14:13] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.96 seconds [15:14:27] (03PS2) 10Mathew.onipe: icinga: enable check for logstash [puppet] - 10https://gerrit.wikimedia.org/r/484685 (https://phabricator.wikimedia.org/T212850) [15:14:49] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686 [15:20:30] (03PS3) 10Mathew.onipe: icinga: enable check for psi and omega cluster [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) [15:20:35] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational [15:22:09] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/484685 [15:23:09] marostegui: Is it safe for me to restart the actor migration on s3 now? [15:23:40] anomie: no, we are doing the failover tomorrow, so it needs to be stopped till tomorrow, sorry about that [15:24:23] 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Matthias_Geisler_WMDE) 05Open→03Resolved [15:24:51] 10Operations, 10User-Addshore: Reset Wikitech 2FA access for Matthias_Geisler_WMDE - https://phabricator.wikimedia.org/T213928 (10Matthias_Geisler_WMDE) Thanks!!!!! [15:25:12] ok, s3 should still get to complete before s1 finishes even with the extra delay. [15:25:38] anomie: I will ping you tomorrow as soon as we are completely done with the failover so you can resume once you get online [15:26:13] ok [15:26:30] sorry :( [15:26:43] This came kinda unexpectedly [15:26:55] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 40.17 seconds [15:26:59] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 33.22 seconds [15:26:59] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 33.28 seconds [15:27:11] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 1.17 seconds [15:27:31] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.22 seconds [15:27:45] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:27:51] a-team are we doing today the first analytics deployment train? :) [15:28:01] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.52 seconds [15:28:02] the A Train [15:28:39] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational [15:29:37] fdans: let's chat in #analytics to see if there is stuff to deploy first [15:29:51] OH SORRY [15:33:26] 10Operations, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) p:05Triage→03Normal [15:33:46] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686 (owner: 10Marostegui) [15:34:14] (03PS1) 10Gehel: wdqs: fix broken logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/484690 [15:34:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686 (owner: 10Marostegui) [15:35:23] (03PS2) 10Marostegui: db-eqiad.php: Promote db1078 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484614 (https://phabricator.wikimedia.org/T213858) [15:35:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 T209815 (duration: 00m 52s) [15:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:57] T209815: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 [15:37:37] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) 05Open→03Resolved Thank you so much! The server is back in the mix. [15:38:11] (03CR) 10DCausse: [C: 03+1] wdqs: fix broken logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/484690 (owner: 10Gehel) [15:39:03] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:39:32] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: fix broken logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/484690 (owner: 10Gehel) [15:40:31] (03PS2) 10Jcrespo: mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) [15:41:01] !log "Import new debdeploy 0.0.99.7 packages for stretch T207845 [15:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:04] T207845: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 [15:43:04] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484686 (owner: 10Marostegui) [15:44:14] (03CR) 10Vgutierrez: [C: 03+1] Expand Coordinator.resultUp behavior on first monitor check result [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 (owner: 10Mark Bergsma) [15:50:01] (03PS2) 10Kosta Harlan: EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) [15:50:21] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo) [15:51:28] (03Merged) 10jenkins-bot: mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo) [15:55:59] (03CR) 10jenkins-bot: mariadb: Depool db1123 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484642 (https://phabricator.wikimedia.org/T213858) (owner: 10Jcrespo) [15:56:08] !log Import new debdeploy 0.0.99.7 packages for jessie T207845 [15:56:08] (03PS26) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [15:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:11] T207845: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 [15:56:11] (03PS1) 10DCausse: [cirrus] Start using replica group settings (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) [15:56:25] (03CR) 10Gehel: [C: 03+2] wdqs: fix broken logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/484690 (owner: 10Gehel) [15:57:45] !log otto@deploy1001 Started deploy [analytics/superset/deploy@f73b897]: bump to 0.26.3-wikimedia2 with chart format string fix [15:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:21] !log otto@deploy1001 Finished deploy [analytics/superset/deploy@f73b897]: bump to 0.26.3-wikimedia2 with chart format string fix (duration: 00m 36s) [15:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:55] (03PS1) 10GTirloni: wmcs::nfs::misc - Add nfsd-ldap package back [puppet] - 10https://gerrit.wikimedia.org/r/484694 (https://phabricator.wikimedia.org/T209527) [15:59:11] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 (duration: 00m 52s) [15:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:17] !log Import new debdeploy 0.0.99.7 packages for buster T207845 [15:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:03] (03CR) 10EBernhardson: [cirrus] Start using replica group settings (take 2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [16:02:27] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Add nfsd-ldap package back [puppet] - 10https://gerrit.wikimedia.org/r/484694 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:02:43] (03PS2) 10GTirloni: wmcs::nfs::misc - Add nfsd-ldap package back [puppet] - 10https://gerrit.wikimedia.org/r/484694 (https://phabricator.wikimedia.org/T209527) [16:02:55] !log Import new debdeploy 0.0.99.7 packages for trusty T207845 [16:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:58] T207845: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 [16:08:08] !log upgrade and stop db1123 [16:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:20] (03Abandoned) 10Gehel: [WIP] wdqs: create multiple instances of blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [16:12:06] (03CR) 10Gehel: [C: 04-1] "looks good in principle, but can be simplified" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [16:14:03] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1123 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484697 [16:15:01] (03CR) 10Gehel: [C: 04-1] "see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484685 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [16:15:29] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/483798 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [16:21:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Comments inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [16:22:37] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 43.57 seconds [16:22:37] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 43.64 seconds [16:22:41] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 29.43 seconds [16:22:45] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 16.95 seconds [16:22:53] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 4.51 seconds [16:22:59] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:23:01] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [16:23:11] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [16:23:41] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [16:24:45] 10Operations, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Paladox) Reading notes from https://gitenterprise.me/2019/01/16/migrating-from-gerrit-2-15-to-2-16/ To convert you setup a vanilla gerrit site (weather it be in a separate directo... [16:25:55] (03CR) 10Mark Bergsma: [C: 03+2] Expand Coordinator.resultUp behavior on first monitor check result [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 (owner: 10Mark Bergsma) [16:25:57] (03PS3) 10Smalyshev: Move categories namespace to second instance [puppet] - 10https://gerrit.wikimedia.org/r/484344 (https://phabricator.wikimedia.org/T213212) [16:26:54] (03Merged) 10jenkins-bot: Expand Coordinator.resultUp behavior on first monitor check result [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 (owner: 10Mark Bergsma) [16:35:11] jouncebot: next [16:35:11] In 0 hour(s) and 24 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1700) [16:38:28] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1123 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484697 (owner: 10Jcrespo) [16:39:32] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1123 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484697 (owner: 10Jcrespo) [16:41:46] (03PS4) 10Arturo Borrero Gonzalez: toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418) [16:42:00] (03PS2) 10Alexandros Kosiaris: Remove externalIP settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/484670 [16:42:51] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.80 seconds [16:43:06] (03PS5) 10Arturo Borrero Gonzalez: toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418) [16:45:35] !log upgrading NIC firmware on cp1075 - T203194 [16:45:37] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 (duration: 00m 52s) [16:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:38] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [16:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:35] !log gehel@deploy1001 Started deploy [wdqs/wdqs@6685dc0]: multi instance fixes [16:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418) (owner: 10Arturo Borrero Gonzalez) [16:48:53] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.79 seconds [16:49:03] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.40 seconds [16:49:07] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.55 seconds [16:49:09] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.26 seconds [16:49:19] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.11 seconds [16:49:39] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.84 seconds [16:49:55] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.70 seconds [16:50:22] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10akosiaris) Adding performance-team and core platform team per SoS recommendation to request for help. [16:50:42] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Milimetric) > Which major design goal would that be? /me genuinely interested The next paragraph in T211... [16:53:31] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1075_v4, cp1075_v6 [16:53:33] PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp1075_v4, cp1075_v6 [16:53:35] PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 35 not-conn: cp1075_v6 [16:53:39] !log stop upgrade and restart db1112 [16:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:45] RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 36 ESP OK [16:54:45] RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 36 ESP OK [16:54:49] RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 36 ESP OK [16:56:00] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1123 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484697 (owner: 10Jcrespo) [16:58:04] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10ovasileva) [16:58:04] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@6685dc0]: multi instance fixes (duration: 10m 29s) [16:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1700). [17:00:04] Zoranzoki21, kostajh, and dcausse: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:56] Hey [17:00:59] Here? [17:01:01] o/ [17:01:17] !log gehel@deploy1001 Started deploy [wdqs/wdqs@6685dc0]: multi instance fixes [17:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:45] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@6685dc0]: multi instance fixes (duration: 00m 27s) [17:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:53] Here [17:02:07] Who will SWAT [17:02:52] (I will move my patches for next) [17:03:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Great meeting! We realized that the same problem we are encountering with the available Cloud VPS infrastructu... [17:03:15] I can SWAT [17:04:07] he's gone :/ [17:04:10] kostajh: around? [17:04:36] dcausse: I'm here [17:05:32] !log upgrading NIC firmware in cp1076 - T203194 [17:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:35] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [17:06:13] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan) [17:06:30] (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [17:07:15] !log anomie@deploy1001 Synchronized php-1.33.0-wmf.12/includes/page/WikiPage.php: Add temporary logging for T210739 (duration: 00m 53s) [17:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:20] T210739: Target deletion during page move fails - https://phabricator.wikimedia.org/T210739 [17:07:23] (03Merged) 10jenkins-bot: EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan) [17:08:34] kostajh: it's live on mwdebug1002, can you test there? [17:08:48] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Reedy) Do we want MW to tag edits etc like we did for HHVM? [17:08:58] dcausse: yes, just a few minutes please [17:09:03] sure [17:09:56] (03CR) 10jenkins-bot: EditorJourney: Enable data collection for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484289 (https://phabricator.wikimedia.org/T213348) (owner: 10Kosta Harlan) [17:10:39] (03CR) 10Gehel: [C: 03+2] elasticsearch_cluster: change is_green() implementation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [17:10:41] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) >>! In T213934#4885135, @Reedy wrote: > Do we want MW to tag edits etc like we did for HHVM? I would think so, yes. [17:11:41] (03CR) 10jenkins-bot: elasticsearch_cluster: change is_green() implementation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [17:12:15] dcausse: my account creation request timed out, so trying again (needed in order to verify) [17:12:37] kostajh: oh yes mwdebug1002 is a bit annoying :/ [17:12:51] ? [17:12:55] how come? [17:12:57] got it to work on second try, waiting for someone else on my team to verify [17:13:07] it is lacking resources? we can add more if so [17:13:07] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:07] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:07] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:07] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:11] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:17] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:17] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:21] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [17:13:23] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:25] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:25] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:27] akosiaris: after scap it seems to struggle a lot [17:13:27] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:27] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:27] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:27] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:27] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:28] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:28] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:35] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [17:13:35] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [17:13:37] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [17:13:39] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [17:13:41] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1076_v4, cp1076_v6 [17:13:49] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1076_v4, cp1076_v6 [17:13:51] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 63 not-conn: cp1076_v6 [17:13:53] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 39 not-conn: cp1076_v6 [17:13:53] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 39 not-conn: cp1076_v6 [17:13:53] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 39 not-conn: cp1076_v6 [17:13:53] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 39 not-conn: cp1076_v6 [17:13:55] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 63 not-conn: cp1076_v6 [17:14:19] dcausse: ah yes it's evident from https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?orgId=1&var-server=mwdebug1002&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now-1m [17:14:22] ^ eh, SAL says the firmware was upgraded just recently [17:14:23] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK [17:14:23] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK [17:14:23] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK [17:14:23] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK [17:14:25] vgutierrez: ^ [17:14:27] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK [17:14:33] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK [17:14:33] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK [17:14:37] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [17:14:39] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK [17:14:41] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK [17:14:41] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK [17:14:43] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK [17:14:43] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK [17:14:43] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK [17:14:43] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK [17:14:43] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK [17:14:43] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK [17:14:44] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK [17:14:49] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [17:14:51] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [17:14:51] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [17:14:53] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [17:14:55] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK [17:14:56] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, 10User-Joe: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Jdforrester-WMF) Happy to help with this still, per IRC. :-) [17:15:03] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [17:15:05] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK [17:15:09] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK [17:15:09] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK [17:15:09] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK [17:15:09] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK [17:15:09] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [17:15:24] dcausse: all looks good [17:15:26] dcausse: let me know when swat is done. I can double (or even triple) the vCPUs, but I 'll need a reboot [17:15:50] akosiaris: sure, thanks for looking into it, I'll let you know [17:15:58] kostajh: ok deploying [17:16:41] dcausse: thank you! [17:18:16] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EditorJourney: Enable data collection for viwiki T213348 (duration: 00m 52s) [17:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:19] T213348: Understanding first day: activate for Vietnamese Wikipedia - https://phabricator.wikimedia.org/T213348 [17:18:40] kostajh: yw, (and it's live btw) [17:19:21] <_joe_> yeah we need to give that machine a bit more cpu kostajh [17:19:24] <_joe_> there is a ticket already [17:19:30] <_joe_> sorry I never got to it [17:19:45] (03PS2) 10DCausse: [cirrus] Start using replica group settings (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) [17:19:47] (03PS27) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [17:20:39] mutante: hmm it looks like the FW upgrade is too slow and the IPsec check is triggered [17:21:00] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.39 seconds [17:21:26] vgutierrez: when i pinged i assumed it broke again.. only then realized you were in the middle of it. gotcha [17:21:38] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [17:21:49] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.42 seconds [17:21:59] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.39 seconds [17:22:15] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.37 seconds [17:22:17] !log rolling NIC firmware upgrade cp[1077-1080] - T203194 [17:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:20] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [17:22:35] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.32 seconds [17:22:43] (03Merged) 10jenkins-bot: [cirrus] Start using replica group settings (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [17:23:08] (03CR) 10jenkins-bot: [cirrus] Start using replica group settings (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484693 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [17:25:59] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 372.67 seconds [17:27:31] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 2 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Reedy) ^ Most of it done by reverting Ori's patch to remove the HHVM beta feature and then updating to match [17:29:21] 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10jcrespo) switchover script works as expected (tested on db1111/db1112): `lang=sh, lines=10 ./switchover.py --skip-slave-move db1111 db1112 Starting preflight checks... * Original rea... [17:29:37] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 434.71 seconds [17:30:40] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) So, we use caching in MediaWiki for a... [17:31:25] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 465.08 seconds [17:31:27] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 465.20 seconds [17:35:13] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] Start using replica group settings (take 2) (T210381) (duration: 00m 51s) [17:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:16] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [17:35:59] 10Operations, 10DBA, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) Awesome news! We have to include it on the steps list on our etherpad, which I wrote yesterday evening and needs to be reviewed by you, as it was late in the day, so error... [17:36:53] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: [cirrus] Start using replica group settings (take 2) (T210381) (duration: 00m 51s) [17:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:32] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#4885223, @EvanProdromou wrote: >... [17:39:05] (03PS2) 10Andrew Bogott: proxyleaks.py: update for multi-region and other issues [puppet] - 10https://gerrit.wikimedia.org/r/484303 [17:39:07] (03PS1) 10Andrew Bogott: SGE: move exec-manage script from bastions to grid masters [puppet] - 10https://gerrit.wikimedia.org/r/484713 [17:39:55] (03PS2) 10Effie Mouzeli: role::eqiad::scb: Switch rdb1006 to redis::misc::master [puppet] - 10https://gerrit.wikimedia.org/r/484572 (https://phabricator.wikimedia.org/T213859) [17:40:04] (03CR) 10Andrew Bogott: [C: 03+2] proxyleaks.py: update for multi-region and other issues [puppet] - 10https://gerrit.wikimedia.org/r/484303 (owner: 10Andrew Bogott) [17:40:28] (03CR) 10Andrew Bogott: [C: 03+2] SGE: move exec-manage script from bastions to grid masters [puppet] - 10https://gerrit.wikimedia.org/r/484713 (owner: 10Andrew Bogott) [17:42:55] PROBLEM - Backup of s2 in codfw on db1115 is CRITICAL: Backup for s2 at codfw taken more than 8 days ago: Most recent backup 2019-01-08 17:39:30 [17:53:13] !log stop upgrade and restart db1111 [17:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:09] (03PS1) 10Andrew Bogott: sge: second attempt to get exec-manage installed on the master [puppet] - 10https://gerrit.wikimedia.org/r/484715 [17:54:10] jouncebot: next [17:54:10] In 0 hour(s) and 5 minute(s): Wikidata WikibaseQualityConstraints Job deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1800) [17:55:17] (03CR) 10Andrew Bogott: [C: 03+2] sge: second attempt to get exec-manage installed on the master [puppet] - 10https://gerrit.wikimedia.org/r/484715 (owner: 10Andrew Bogott) [17:55:23] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@0aa107a]: Re-deploy for fixing vars.sh [17:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:55] (03PS1) 10Addshore: ConstraintsCheckJobs enabled on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484719 (https://phabricator.wikimedia.org/T204031) [17:59:10] !next [18:00:04] addshore: Your horoscope predicts another unfortunate Wikidata WikibaseQualityConstraints Job deployment deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1800). [18:00:04] addshore: A patch you scheduled for Wikidata WikibaseQualityConstraints Job deployment is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:07] (03CR) 10Addshore: [C: 03+2] ConstraintsCheckJobs enabled on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484719 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [18:01:41] (03Merged) 10jenkins-bot: ConstraintsCheckJobs enabled on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484719 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [18:01:57] (03CR) 10jenkins-bot: ConstraintsCheckJobs enabled on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484719 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [18:02:52] (03PS28) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [18:02:54] (03PS1) 10DCausse: [cirrus] Enable CirrusSearchCrossClusterSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484720 (https://phabricator.wikimedia.org/T210381) [18:03:11] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ConstraintsCheckJobs enabled on testwikidatawiki T204031 (duration: 00m 52s) [18:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:14] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [18:04:33] addshore: can I use a bit of your deployment time when you're done? [18:04:43] dcausse: yes [18:05:40] addshore: was that 100% of testwd? [18:05:45] mobrovac: yes [18:05:47] k [18:06:02] but with only the edit rate im trying to create now :D [18:06:15] addshore: thanks, in fact it's not working so I won't deploy anything yet :/ [18:06:25] dcausse: :( [18:06:58] mobrovac: testwikidatawiki should show up on https://grafana.wikimedia.org/d/000000105/job-queue-rate?orgId=1&var-Job=constraintsRunCheck&from=now-15m&to=now right? [18:07:12] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@0aa107a]: Re-deploy for fixing vars.sh (duration: 11m 49s) [18:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:19] addshore: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus [18:07:24] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) >>! In T213859#4883927, @elukey wrote: > Ok for me for the analytics nodes, but I'd need a bit of heads up to properly stop them if possible :) > > The Thursday... [18:07:38] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [18:08:28] mobrovac: amazing, so testwikidatawiki is generaitng jobs and they are entering the queue and running? :) [18:08:32] if I'm reading that correctly [18:08:46] lemme double check [18:10:17] i see constraintsRunCheck and constraintsTableUpdate but not constraintsCheck [18:10:18] (03PS1) 10Addshore: ConstraintsCheckJobs enabled on testwikidatawiki (1% of edits) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484723 (https://phabricator.wikimedia.org/T204031) [18:10:26] mobrovac: its called constraintsRunCheck [18:10:27] =] [18:10:34] ah ok [18:10:42] then yes, you are reading this correctly addshore :) [18:10:54] I'll go ahead with 1% of wikidata edits then, and that is where we will leave it today :)_ [18:10:57] (03CR) 10Addshore: [C: 03+2] ConstraintsCheckJobs enabled on testwikidatawiki (1% of edits) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484723 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [18:11:31] kk addshore, sounds good [18:12:02] (03Merged) 10jenkins-bot: ConstraintsCheckJobs enabled on testwikidatawiki (1% of edits) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484723 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [18:12:10] akosiaris: forgot to tell you that I'm done with SWAT (sorry got distracted with things not working) [18:12:18] addshore: for tomorrow, both Petr and I are in PST, so it would be good to continue no earlier than 17:00 UTC [18:12:29] mobrovac: ack! [18:12:53] mobrovac: the plan is just to do one of the increases per day, so tomorrow (if I do it tomorrow) would only be 1% to 5% [18:13:07] kk sounds good addshore [18:13:41] mobrovac: does each queue have some sort of throughput limits? [18:13:56] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ConstraintsCheckJobs enabled on wikidatawiki (1% of edits) T204031 (duration: 00m 51s) [18:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:01] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [18:14:03] I dont really know many of the nitty gritty details of the magical job queue ;) [18:14:12] yup addshore, we set it not to overwhelm the jobrunners and the DB [18:14:22] mobrovac: great, whats the default? [18:14:36] (03CR) 10jenkins-bot: ConstraintsCheckJobs enabled on testwikidatawiki (1% of edits) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484723 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [18:15:16] addshore: https://github.com/wikimedia/mediawiki-services-change-propagation-jobqueue-deploy/blob/master/scap/vars.yaml#L101 [18:15:17] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#4882359, @MSantos wrote: >>>! In T211881#4878426, @akosiaris wrote: >>> @akosia... [18:15:22] mobrovac: thanks! [18:15:25] addshore: 50 concurrent execs [18:15:33] addshore: what's the volume you are expecting? [18:15:50] rate actually, rather than volume [18:16:23] well, rate of edits is 400-1000 epm, there will be some amount of deduplication there too, [18:16:44] !log deploy slot done [18:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:49] i see [18:17:01] kk, we'll likely need a special rule for this one then [18:17:01] mobrovac: I can do some more maths later today :) [18:17:15] mobrovac: the run times of the jobs can also vary greatly [18:18:02] addshore: ok, then we'll definitely need a special rule for this job as i foresee some fine-tuning :) [18:18:25] mobrovac: yep! I'm looking forward to it [18:21:01] ACKNOWLEDGEMENT - Juniper alarms on asw-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi https://phabricator.wikimedia.org/T213859 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:21:01] ACKNOWLEDGEMENT - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi https://phabricator.wikimedia.org/T213859 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:24:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] "True but there's a catch. We won't have another meeting to approve such requests for another 2 weeks. I 'll escalate to mark and faidon fo" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani) [18:33:33] (03PS8) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [18:33:44] (03CR) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [18:48:06] (03CR) 10Mobrovac: [C: 03+1] "PCC - https://puppet-compiler.wmflabs.org/compiler1002/14353/scandium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [18:55:59] !log upgraded jenkins version for jessie and stretch in apt.wikimedia.org to latest LTS [18:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T1900) [19:01:15] (03CR) 10Faidon Liambotis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani) [19:02:15] im here (for the gerrit upgrade) though will also be watching a vote in the uk. [19:04:20] !log starting gerrit upgrade to 2.15.8 [19:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:04] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) Diff for option 3 in eqiad is: `lang=diff [edit interfaces ae1 unit 1017 family inet] + filter { + output private-out4; + } [edit interface... [19:09:26] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@cec7995]: Gerrit to 2.15.8 on gerrit2001 only [19:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:38] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@cec7995]: Gerrit to 2.15.8 on gerrit2001 only (duration: 00m 11s) [19:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:37] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@cec7995]: Gerrit to 2.15.8 on cobalt [19:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:47] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@cec7995]: Gerrit to 2.15.8 on cobalt (duration: 00m 10s) [19:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:49] !log restarting gerrit on cobalt for 2.15.8 upgrade [19:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:43] (03PS9) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [19:17:57] addshore Reedy https://phabricator.wikimedia.org/T213915#4885559 https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/484734/ [19:20:43] I think the patch provided should smooth the upcoming train and remaining deploys this week? [19:20:52] marxarelli: hi!! ^ [19:21:06] thx again 4 ur patience [19:23:25] anyone getting 503s from gerrit recently? oh wait nevermind that was the deploy i bet :D [19:23:42] brion: upgrade happened a few min ago but up [19:23:43] \o/ [19:23:47] yep looks good now [19:23:50] AndyRussG: ok. was that already swat deployed to groups on wmf.12? [19:23:50] cool [19:26:16] AndyRussG: i'm rolling wmf.13 this week which, if i'm reading that task correctly, will incorporate the CN you want deployed [19:27:30] marxarelli: mmm noo, lemme explain [19:27:41] so CentralNotice is a special snowflake for deploys [19:28:03] basically the submodule just points to the head of the wmf_deploy branch, always [19:28:13] which we update periodically [19:28:49] (this should change, btw, see https://phabricator.wikimedia.org/T136904 ) [19:29:09] !log restarting ci jenkins for upgrade [19:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:21] So this week we did an update... however the intention was to update for .13, not .12 [19:29:45] since it also affected .12, I made the above patch, to keep .12 at the currently deployed version of CN [19:30:30] I guess it might be more relevant for SWAT deploys then, since the train deploy will only push out .13, come to think of it [19:30:45] I dunno if train deploys ever do anything with the old branch [19:31:24] Here is the task that I made that patch in response to: https://phabricator.wikimedia.org/T213915 [19:31:40] (03PS6) 10Thcipriani: Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [19:32:12] (03CR) 10jerkins-bot: [V: 04-1] Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [19:32:37] Anyway, fwiw, wmf.12 should have CN at 63b5490 (which is what's currently deployed) and wmf.13 should be at 445deea40 (also what's currently deployed to wikis that are on wmf.13) [19:34:33] Reedy yt? [19:35:36] > You just need to update the git sub module for CN on deploy1001 and sync it :) [19:35:38] With or without the patch (https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/484734/) ? [19:36:08] Though I have deploy rights it's been ages since I did an actual deploy myself, and in truth it kinda terrifies me [19:36:39] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 234.44 seconds [19:43:41] (03CR) 10Dzahn: [C: 03+2] "merging because it's just a revert of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/410072/" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [19:43:42] hi, could someone open https://phabricator.wikimedia.org/T210785 please? [19:44:01] paladox: i can [19:44:07] thanks! [19:44:09] what do you need [19:44:13] ^ ? [19:44:15] just to know if i can ? [19:44:18] to change visibility to public. [19:44:28] since it was deployed. [19:44:29] ah, cool, thought we weren't done somehow :) [19:44:31] ah [19:44:33] heh [19:45:39] thcipriani: so you agree to make public? [19:45:49] +1 [19:46:06] done [19:46:38] thanks :) [19:46:40] AndyRussG: right, so the "you just need to update the submodule for CN and sync it" is implying a swat deploy [19:47:14] (03PS4) 10Gehel: Move categories namespace to second instance [puppet] - 10https://gerrit.wikimedia.org/r/484344 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [19:47:53] trains roll the next versioned branch for core out, but for cherry-picks and backports for already deployed versioned branches need to be done via swat [19:48:12] !log switching wdqs categories traffic to new second instance, puppet will be disabled during the operation on all wdqs nodes - T213212 [19:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:15] T213212: Move category namespace to a separate blazegraph instance - https://phabricator.wikimedia.org/T213212 [19:48:25] (03PS10) 10Dzahn: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [19:49:30] (03CR) 10Gehel: [C: 03+2] Move categories namespace to second instance [puppet] - 10https://gerrit.wikimedia.org/r/484344 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [19:49:51] AndyRussG: hope that makes sense. if you want help scheduling a swat for that, you can hit someone up from https://wikitech.wikimedia.org/wiki/SWAT_deploys#The_team in -releng [19:52:00] marxarelli: yeee gotcha, thx!! [19:52:13] (03PS11) 10Dzahn: mediawiki: Stop logging each run of purge_abusefilter.pp [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [19:52:51] (03PS1) 10ArielGlenn: do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912) [19:53:08] AndyRussG: np! [19:55:23] PROBLEM - WDQS SPARQL on wdqs1010 is CRITICAL: connect to address 10.64.32.63 and port 80: Connection refused [19:55:31] (03PS1) 10Gehel: wdqs: fix typo in nginx config file [puppet] - 10https://gerrit.wikimedia.org/r/484755 (https://phabricator.wikimedia.org/T213212) [19:55:37] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:55:47] PROBLEM - WDQS HTTP on wdqs1010 is CRITICAL: connect to address 10.64.32.63 and port 80: Connection refused [19:56:09] PROBLEM - WDQS HTTP Port on wdqs1010 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused [19:56:19] ^ that's me, it's a test server and the fix is coming up [19:56:44] (03CR) 10Gehel: [C: 03+2] wdqs: fix typo in nginx config file [puppet] - 10https://gerrit.wikimedia.org/r/484755 (https://phabricator.wikimedia.org/T213212) (owner: 10Gehel) [19:56:52] (03PS2) 10Gehel: wdqs: fix typo in nginx config file [puppet] - 10https://gerrit.wikimedia.org/r/484755 (https://phabricator.wikimedia.org/T213212) [19:57:11] saw it, thanks [19:59:01] RECOVERY - WDQS SPARQL on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 17141 bytes in 0.001 second response time [19:59:15] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [19:59:25] RECOVERY - WDQS HTTP on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 17141 bytes in 0.001 second response time [19:59:49] RECOVERY - WDQS HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.017 second response time [20:00:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10bd808) >>! In T207321#4885112, @Ottomata wrote: > A. Bryan noted that we won't want to use users regular LDAP accounts fo... [20:00:04] marxarelli: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T2000). [20:05:42] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) > I was trying to figure out how this will actually look for the end user It'd be in the JDBC connection, e.g... [20:12:06] (03CR) 10Dzahn: "eh.. got the duplicate deploy now. File[/srv/deployment/parsoid/deploy/deploy]" [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [20:14:24] (03PS1) 10Dduvall: group1 wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484759 [20:14:26] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484759 (owner: 10Dduvall) [20:15:47] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.58 seconds [20:16:01] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484759 (owner: 10Dduvall) [20:19:05] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.13 [20:19:57] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.13 (duration: 00m 51s) [20:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:19] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484759 (owner: 10Dduvall) [20:30:27] PROBLEM - Nginx local proxy to apache on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:41] PROBLEM - Apache HTTP on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:51] PROBLEM - HHVM rendering on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:45] RECOVERY - Nginx local proxy to apache on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.036 second response time [20:33:01] RECOVERY - Apache HTTP on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time [20:33:11] RECOVERY - HHVM rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 75751 bytes in 0.096 second response time [20:33:43] 10Puppet, 10AbuseFilter, 10Patch-For-Review, 10User-MarcoAurelio: Check if it is safe to disable logging for purge_abusefilter.pp cron job - https://phabricator.wikimedia.org/T213591 (10MarcoAurelio) 05Open→03Resolved a:03MarcoAurelio Nothing left to do here. [20:33:57] 10Puppet, 10AbuseFilter, 10User-MarcoAurelio: Check if it is safe to disable logging for purge_abusefilter.pp cron job - https://phabricator.wikimedia.org/T213591 (10MarcoAurelio) [20:36:25] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.70 seconds [20:38:21] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.68 seconds [20:38:41] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.26 seconds [20:38:57] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.27 seconds [20:39:23] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.71 seconds [20:39:27] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.71 seconds [20:39:27] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.48 seconds [20:39:37] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.49 seconds [20:39:41] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.02 seconds [20:41:37] (03CR) 10DCausse: [C: 04-2] "https://github.com/elastic/elasticsearch/issues/26833" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484720 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [20:57:11] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 373 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:59:09] thanks for the merge mutante [21:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: (Dis)respected human, time to deploy Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190116T2100). Please do the needful. [21:02:23] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 22 probes of 373 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:05:42] (03PS1) 10Smalyshev: Remove $port handling - it's replaced with $endpoint now [puppet] - 10https://gerrit.wikimedia.org/r/484764 [21:05:57] (03PS1) 10Bstorm: Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 [21:06:34] (03CR) 10jerkins-bot: [V: 04-1] Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 (owner: 10Bstorm) [21:08:00] (03PS1) 10Bstorm: toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951) [21:09:10] (03PS2) 10Bstorm: Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 [21:09:50] (03CR) 10Andrew Bogott: [C: 03+1] toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951) (owner: 10Bstorm) [21:10:05] (03CR) 10Andrew Bogott: [C: 03+1] Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 (owner: 10Bstorm) [21:10:48] (03CR) 10Bstorm: [C: 03+2] Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 (owner: 10Bstorm) [21:10:56] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@6b344ca]: Update mobileapps to 258d76b page summary changes [21:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:00] (03PS3) 10Bstorm: Revert "SGE: move exec-manage script from bastions to grid masters" [puppet] - 10https://gerrit.wikimedia.org/r/484765 [21:12:48] 10Operations: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10Legoktm) [21:13:03] (03CR) 10Gehel: [C: 03+2] Remove $port handling - it's replaced with $endpoint now [puppet] - 10https://gerrit.wikimedia.org/r/484764 (owner: 10Smalyshev) [21:13:12] (03PS2) 10Gehel: Remove $port handling - it's replaced with $endpoint now [puppet] - 10https://gerrit.wikimedia.org/r/484764 (owner: 10Smalyshev) [21:14:09] (03PS2) 10Bstorm: toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951) [21:15:40] (03PS5) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [21:15:49] (03PS3) 10Bstorm: toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951) [21:17:08] (03CR) 10Bstorm: [C: 03+2] toolforge: change exec-manage to not try submissions and set up on masters [puppet] - 10https://gerrit.wikimedia.org/r/484769 (https://phabricator.wikimedia.org/T213951) (owner: 10Bstorm) [21:17:26] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@6b344ca]: Update mobileapps to 258d76b page summary changes (duration: 06m 31s) [21:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:44] (03PS6) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [21:20:12] (03PS1) 10RobH: adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 [21:22:41] !log bmansurov@deploy1001 Started deploy [recommendation-api/deploy@da83637]: Update to 1a1f824 [21:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:44] (03CR) 10jerkins-bot: [V: 04-1] adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 (owner: 10RobH) [21:26:29] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10bd808) >>! In T207321#4885829, @Ottomata wrote: >> I was trying to figure out how this will actually look for the end use... [21:26:46] (03PS3) 10Cwhite: mediawiki: enable statsd_exporter and add matching rules to appserver [puppet] - 10https://gerrit.wikimedia.org/r/481110 (https://phabricator.wikimedia.org/T205870) [21:28:55] !log bmansurov@deploy1001 Finished deploy [recommendation-api/deploy@da83637]: Update to 1a1f824 (duration: 06m 14s) [21:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:13] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 500 (expecting: 200) [21:29:13] !log ppchelko@deploy1001 Started deploy [recommendation-api/deploy@da83637]: log [21:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:15] !log ppchelko@deploy1001 deploy aborted: log (duration: 00m 02s) [21:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:21] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 46.45 seconds [21:29:29] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 500 (expecting: 200) [21:29:31] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 23.35 seconds [21:29:35] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 500 (expecting: 200) [21:29:41] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 1.21 seconds [21:29:55] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [21:30:07] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:30:21] Pchelolo ^ is that you (the scb* pages)? [21:30:23] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [21:30:29] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:30:39] jijiki: that's me [21:30:41] sorry [21:30:47] hehe [21:30:54] it's not a problem, it's not really exposed to anyone yet [21:33:33] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 500 (expecting: 200) [21:33:35] !log ppchelko@deploy1001 Started deploy [recommendation-api/deploy@c1b6b32]: Rollback update to 1a1f824 [21:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:36] (03CR) 10Cwhite: "Puppet compiler link per volans' request. 10 resources added." [puppet] - 10https://gerrit.wikimedia.org/r/481110 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:34:17] 10Operations, 10ops-eqiad, 10Analytics, 10Product-Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Neil_P._Quinn_WMF) [21:34:23] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [21:34:27] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [21:34:45] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [21:35:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) > That works for a java developer, but nobody else (including Quarry). Quarry is a Python app which is why I wa... [21:35:21] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [21:35:34] !log ppchelko@deploy1001 Finished deploy [recommendation-api/deploy@c1b6b32]: Rollback update to 1a1f824 (duration: 01m 59s) [21:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:57] 10Operations, 10Legalpad: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10Aklapper) [21:38:49] (03PS2) 10ArielGlenn: do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912) [21:48:40] (03PS1) 10Andrew Bogott: cold-migrate: use the new nova_eqiad1 database [puppet] - 10https://gerrit.wikimedia.org/r/484774 [21:49:04] (03PS2) 10Andrew Bogott: cold-migrate: use the new nova_eqiad1 database [puppet] - 10https://gerrit.wikimedia.org/r/484774 [21:49:43] (03CR) 10Andrew Bogott: [C: 03+2] cold-migrate: use the new nova_eqiad1 database [puppet] - 10https://gerrit.wikimedia.org/r/484774 (owner: 10Andrew Bogott) [21:52:37] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [21:54:22] 10Operations, 10Legalpad: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10CDanis) p:05Triage→03Normal [21:55:45] RECOVERY - High lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 1059 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:58:13] (03PS7) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [22:06:33] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10awight) We should probably decline this task in favor of {T182331}. [22:09:04] (03CR) 10CRusnov: [C: 03+1] "Just a slight followup here. Overall LGTM." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/481914 (owner: 10Volans) [22:13:24] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10ayounsi) [22:15:53] (03PS12) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [22:43:25] (03PS2) 10Dzahn: xhgui: Remove outdated clone of xhprof mirror [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle) [22:44:16] (03CR) 10Dzahn: [C: 03+2] xhgui: Remove outdated clone of xhprof mirror [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle) [22:44:58] PROBLEM - Backup of s4 in eqiad on db1115 is CRITICAL: Backup for s4 at eqiad taken more than 8 days ago: Most recent backup 2019-01-08 22:35:52 [22:47:00] (03CR) 10Dzahn: [C: 03+2] "do you want me to also rm /srv/xhprof/profiles ?" [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle) [22:50:56] (03CR) 10Krinkle: "yes please, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle) [22:51:08] (03CR) 10Krinkle: "/srv/xhprof/ entirely actually" [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) (owner: 10Krinkle) [22:52:32] (03PS1) 10Jforrester: WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781 [22:52:34] (03PS1) 10Jforrester: [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782 [22:53:32] (03CR) 10jerkins-bot: [V: 04-1] WBMI: Disable showing 'depicts' statements on Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484781 (owner: 10Jforrester) [22:53:36] (03CR) 10jerkins-bot: [V: 04-1] [Beta Cluster] WBMI: Show 'depicts' statements on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484782 (owner: 10Jforrester) [23:04:55] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 52.55 seconds [23:05:03] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 52.51 seconds [23:05:09] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 50.52 seconds [23:05:11] RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 48.47 seconds [23:05:13] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 47.74 seconds [23:05:17] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 48.41 seconds [23:05:23] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 44.13 seconds [23:05:41] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 40.74 seconds [23:05:41] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 36.18 seconds [23:08:09] (03PS1) 10RobH: setting test to use python2 [software] - 10https://gerrit.wikimedia.org/r/484783 [23:09:03] (03CR) 10jerkins-bot: [V: 04-1] setting test to use python2 [software] - 10https://gerrit.wikimedia.org/r/484783 (owner: 10RobH) [23:09:15] (03CR) 10Alex Monk: [C: 03+2] Release 0.8 [software/certcentral] - 10https://gerrit.wikimedia.org/r/484511 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [23:10:37] !log krinkle@tungsten:/srv/: rm -rf xhprof; for T196406 [23:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:43] T196406: Decom "xhprof" viewer - https://phabricator.wikimedia.org/T196406 [23:10:56] (03Merged) 10jenkins-bot: Release 0.8 [software/certcentral] - 10https://gerrit.wikimedia.org/r/484511 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [23:11:17] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [23:12:31] (03CR) 10jenkins-bot: Release 0.8 [software/certcentral] - 10https://gerrit.wikimedia.org/r/484511 (https://phabricator.wikimedia.org/T209980) (owner: 10Vgutierrez) [23:13:24] (03CR) 10RobH: [V: 03+2 C: 03+2] setting test to use python2 [software] - 10https://gerrit.wikimedia.org/r/484783 (owner: 10RobH) [23:13:57] (03PS2) 10RobH: adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 [23:14:48] (03CR) 10jerkins-bot: [V: 04-1] adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 (owner: 10RobH) [23:16:20] (03PS1) 10RobH: Revert "setting test to use python2" [software] - 10https://gerrit.wikimedia.org/r/484786 [23:16:26] (03CR) 10RobH: [V: 03+2 C: 03+2] Revert "setting test to use python2" [software] - 10https://gerrit.wikimedia.org/r/484786 (owner: 10RobH) [23:21:24] !log ppchelko@deploy1001 Started deploy [recommendation-api/deploy@0ff39e2]: Deployment attempt with decreased worker count [23:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:46] (03PS1) 10RobH: trying to fix CI testing [software] - 10https://gerrit.wikimedia.org/r/484787 [23:22:37] (03CR) 10jerkins-bot: [V: 04-1] trying to fix CI testing [software] - 10https://gerrit.wikimedia.org/r/484787 (owner: 10RobH) [23:23:25] (03CR) 10RobH: [V: 03+2 C: 03+2] "we expected a v-1 due to python setting we are changing via this very patchset" [software] - 10https://gerrit.wikimedia.org/r/484787 (owner: 10RobH) [23:25:04] (03CR) 10Paladox: "see https://github.com/wikimedia/certcentral/blob/8f316e33511707b6d871ad8e15868dfd77e32ff7/tox.ini" (034 comments) [software] - 10https://gerrit.wikimedia.org/r/484787 (owner: 10RobH) [23:25:31] !log ppchelko@deploy1001 Finished deploy [recommendation-api/deploy@0ff39e2]: Deployment attempt with decreased worker count (duration: 04m 08s) [23:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:20] (03PS3) 10RobH: adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 [23:28:02] (03CR) 10jerkins-bot: [V: 04-1] adding in R440 single CPU SKU [software] - 10https://gerrit.wikimedia.org/r/484771 (owner: 10RobH) [23:28:44] (03PS1) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 [23:28:59] (03PS2) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 [23:29:45] (03PS1) 10RobH: Revert "trying to fix CI testing" [software] - 10https://gerrit.wikimedia.org/r/484791 [23:29:47] (03PS3) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 [23:29:50] (03CR) 10jerkins-bot: [V: 04-1] test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 (owner: 10Paladox) [23:29:54] (03CR) 10RobH: [V: 03+2 C: 03+2] Revert "trying to fix CI testing" [software] - 10https://gerrit.wikimedia.org/r/484791 (owner: 10RobH) [23:30:36] (03CR) 10jerkins-bot: [V: 04-1] test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 (owner: 10Paladox) [23:30:45] (03PS4) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 [23:30:51] (03CR) 10jerkins-bot: [V: 04-1] test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 (owner: 10Paladox) [23:31:11] (03Abandoned) 10Paladox: test: do not merge [software] - 10https://gerrit.wikimedia.org/r/484790 (owner: 10Paladox) [23:31:27] (03PS1) 10Paladox: test do not merge [software] - 10https://gerrit.wikimedia.org/r/484792 [23:31:37] (03PS2) 10Paladox: test do not merge [software] - 10https://gerrit.wikimedia.org/r/484792 [23:32:17] (03CR) 10jerkins-bot: [V: 04-1] test do not merge [software] - 10https://gerrit.wikimedia.org/r/484792 (owner: 10Paladox) [23:32:58] 10Operations, 10netops, 10Patch-For-Review: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) a:05ayounsi→03faidon Note that I also created an LibreNMS alert to monitor explicitly the mgmt network: `%syslog.msg ~ "KERN_ARP_ADDR_CHANGE" && %devices.hostname ~ "mr" && %de... [23:33:49] (03Abandoned) 10Paladox: test do not merge [software] - 10https://gerrit.wikimedia.org/r/484792 (owner: 10Paladox) [23:37:21] 10Operations, 10netops, 10Patch-For-Review: IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46 - https://phabricator.wikimedia.org/T201039 (10ayounsi) p:05High→03Normal [23:38:03] (03PS10) 10Dzahn: services: add missing 'mediawiki/services' prefix to git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) [23:42:20] (03PS3) 10Dzahn: Add Blubber directory to releases server [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani) [23:42:24] (03PS1) 10Cwhite: role: add prometheus2 rules (new format) [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) [23:43:08] (03CR) 10Dzahn: [C: 03+2] Add Blubber directory to releases server [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani) [23:44:00] (03PS2) 10Cwhite: role: add prometheus2 rules (new format) [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) [23:45:11] (03CR) 10Dzahn: [C: 03+2] "deployed, releases1001 and releases2001 have the new dir, rsync and users" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani) [23:46:27] 10Operations, 10netops, 10Patch-For-Review: IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46 - https://phabricator.wikimedia.org/T201039 (10ayounsi) Reducing priority as the situation is stable. At this point I don't think the cost of upgrading the switch stacks of row B and C (full row down for ~15min) is... [23:50:13] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 57.18 seconds [23:50:25] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 54.26 seconds [23:50:29] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 54.56 seconds [23:50:35] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 53.65 seconds [23:50:37] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 52.12 seconds [23:50:41] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 51.08 seconds [23:50:51] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 48.91 seconds [23:53:06] (03PS4) 10Huji: Add new synonyms for namespaces in Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733) [23:53:08] 10Operations, 10Traffic, 10netops: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) [23:53:26] (03PS4) 10Huji: Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) [23:53:32] 10Operations, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10Platonides) >>! In T213655#4880685, @jcrespo wrote: > Could you try to restore it @Platonides using the wiki admin tools before trying some SQL? There is no entry to restore on the wiki. No link stating th... [23:53:48] (03CR) 10jerkins-bot: [V: 04-1] Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji) [23:54:44] Hi! Is anyone SWATting? addshore hashar aude MaxSem twentyafterfour RoanKattouw Dereckson thcipriani Niharika zeljkof [23:55:25] I don't have any patches specifically to deploy, but there is one to clean up CN submodule prior to deploys to wmf.12 [23:55:48] The SWAT window doesn't start for another 5 minutes, and also it's empty [23:55:55] RoanKattouw: yeah I saw [23:56:06] So there's still time to make it not empty ;) [23:56:24] RoanKattouw: can you take maybe a peek at this to make sure it's sane? https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/484734/ [23:56:28] 10Operations, 10Traffic, 10netops: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) [23:56:30] 10Operations, 10netops: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) [23:57:09] as per https://phabricator.wikimedia.org/T213915 [23:58:32] AndyRussG: Checked it properly, I can confirm it's a revert of https://github.com/wikimedia/mediawiki/commit/251c06deb12a57e0c6891d047fa045697495c5ee as expected :) [23:59:01] Reedy: ok cool thanks!!! Maybe now's a good time as any to merge it in? [23:59:19] or I guess clean out the deploy server? [23:59:20] Might aswell [23:59:29] It's basically creating a noop at this point [23:59:33] yeah [23:59:35] PROBLEM - Backup of s6 in eqiad on db1115 is CRITICAL: Backup for s6 at eqiad taken more than 8 days ago: Most recent backup 2019-01-08 23:34:19 [23:59:39] Yes I just independently concluded it's an undo of that bit merge commit [23:59:52] (03CR) 10Mobrovac: services: add missing 'mediawiki/services' prefix to git cloning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484602 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [23:59:55] I don't have the context to understand what's going on, but if Reedy says it makes sense then I believe him [23:59:56] RoanKattouw: yeah the big merge commit was meant just for wmf.13