[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190208T0000). [00:00:04] ebernhardson and AndyRussG: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:39] o/ [00:01:52] \o [00:02:16] i can ship it, two config patches should be easy [00:02:17] (03CR) 10Dzahn: "@bblack is this a bad idea? trying to eliminate that users have to change their local config when the numbers change and we do the same fo" [dns] - 10https://gerrit.wikimedia.org/r/489103 (owner: 10Dzahn) [00:02:45] ebernhardson: ah cool thanks! [00:02:47] (03PS6) 10EBernhardson: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) (owner: 10AndyRussG) [00:03:44] AndyRussG: you said "Requires confirmation that this is acceptable policy." and accessing the related task gives me an access denied. I'll simply trust you got that confirmation? [00:04:01] ebernhardson: yes it's on the task, in a comment [00:04:10] excellent [00:04:11] sorry you should be able to see the task [00:04:24] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) (owner: 10AndyRussG) [00:06:20] (03Merged) 10jenkins-bot: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) (owner: 10AndyRussG) [00:06:55] AndyRussG: pulled to mwdebug1001 [00:07:13] ebernhardson: k one sec [00:07:23] I sillily uninstalled the mw debug browser add on [00:07:28] gotta put that back [00:07:35] btw you should be able to see the task now [00:07:48] i have too many plugins installed...should drop a few. i guess not the mwdebug one :) [00:07:52] (03CR) 10Dzahn: [C: 03+1] "literally "+1 lgtm but somebody else must approve". I filled out the access request section in the Etherpad for the Monday SRE meeting and" [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [00:08:19] (03CR) 10jenkins-bot: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) (owner: 10AndyRussG) [00:08:24] (03PS2) 10EBernhardson: Turn off wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488588 (https://phabricator.wikimedia.org/T214515) [00:08:46] (I was seeing a bunch of network activity that I didn't know the source of, so I wanted to see if that was a cause) [00:09:02] (Just because you're paranoid doesn't mean they're not after you) [00:10:10] ebernhardson: lgtm! [00:13:04] (03CR) 10EBernhardson: [C: 03+2] Turn off wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488588 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:13:13] (03PS1) 10Dzahn: remove bast3003, bast3002 has been repaired [dns] - 10https://gerrit.wikimedia.org/r/489104 (https://phabricator.wikimedia.org/T184936) [00:13:59] (03PS2) 10Dzahn: remove bast3003, bast3002 has been repaired (?) [dns] - 10https://gerrit.wikimedia.org/r/489104 (https://phabricator.wikimedia.org/T184936) [00:14:08] (03Merged) 10jenkins-bot: Turn off wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488588 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:14:43] (03CR) 10Dzahn: [C: 04-1] remove bast3003, bast3002 has been repaired (?) [dns] - 10https://gerrit.wikimedia.org/r/489104 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [00:14:49] AndyRussG: it's almost sync'd ... i think there is still one mw instance that's timing out (was doing it earlier swat today too) [00:14:56] (03CR) 10Dzahn: [C: 04-2] remove bast3003, bast3002 has been repaired (?) [dns] - 10https://gerrit.wikimedia.org/r/489104 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [00:15:17] hmmm okok [00:15:27] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT gerrit:483044 T209873 Give protect right to centralnoticeadmin on Meta (duration: 02m 56s) [00:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:28] there it goes [00:16:47] !log scap sync timed out on mw1299.eqiad.wmnet [00:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:43] ebernhardson: Host is down, I filed a task (it's been rebooted once already today) [00:18:14] (03PS2) 10Dzahn: CNAMEs for bastions in each DC for user convenience [dns] - 10https://gerrit.wikimedia.org/r/489103 [00:18:18] Reedy: gotcha [00:18:18] (03CR) 10BryanDavis: "> Unfortunately, we need PHP7.2 and the gridengine only has 5.5 I" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [00:18:27] (03CR) 10jerkins-bot: [V: 04-1] CNAMEs for bastions in each DC for user convenience [dns] - 10https://gerrit.wikimedia.org/r/489103 (owner: 10Dzahn) [00:20:42] (03CR) 10jenkins-bot: Turn off wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488588 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:21:14] ebernhardson: thanks much!! :) [00:21:33] looks good now generally (i.e. not just on the debug host) [00:21:49] (03PS3) 10Dzahn: contint: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485096 [00:22:02] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT gerrit:488588 phab:T214515 Turn off wikidata wbsearchentities ab test in de, fr, es (duration: 02m 55s) [00:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:06] T214515: Run wikidata entitiy autocomplete AB test in de, fr, es - https://phabricator.wikimedia.org/T214515 [00:23:24] SWAT complete [00:24:29] yeeee :) [00:26:31] (03CR) 10Dzahn: [C: 03+2] "noop in prod: https://puppet-compiler.wmflabs.org/compiler1002/14578/" [puppet] - 10https://gerrit.wikimedia.org/r/485096 (owner: 10Dzahn) [00:29:49] (03CR) 10Dzahn: [C: 04-1] "talked at allhands and to Mark about it. abandoning / recycling in favor of making notes_url a mandatory parameter for newly added icinga " [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [00:37:13] (03CR) 10Dzahn: [C: 03+1] "ok, ready to merge this but what's a good test we should do to make sure no phab mail features are affected" [puppet] - 10https://gerrit.wikimedia.org/r/482400 (https://phabricator.wikimedia.org/T212989) (owner: 10Paladox) [00:43:16] (03PS2) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 [00:43:39] (03PS1) 10Ayounsi: Icinga: add ping check for ulsfo PDUs [puppet] - 10https://gerrit.wikimedia.org/r/489113 (https://phabricator.wikimedia.org/T209101) [00:43:49] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (owner: 10Paladox) [00:44:46] (03PS3) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 [00:45:24] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (owner: 10Paladox) [00:47:15] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10thcipriani) >>! In T207707#4909130, @Dzahn wrote: > Let's ask dcops instead and request a new disk to be ad... [00:48:06] jouncebot: next [00:48:06] In 81 hour(s) and 41 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T1030) [00:48:26] (03PS4) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 [00:49:06] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (owner: 10Paladox) [00:49:23] (03PS5) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 [00:50:02] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/14579/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/489113 (https://phabricator.wikimedia.org/T209101) (owner: 10Ayounsi) [00:50:08] (03PS6) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 [00:50:42] (03PS7) 10Dzahn: jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 [00:50:59] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (owner: 10Paladox) [00:51:46] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [00:54:47] * bd808 sees he will have all weekend to break and fix jouncebot [00:54:51] (03PS10) 10Dzahn: phabricator: Add new cluster.mailers [puppet] - 10https://gerrit.wikimedia.org/r/482400 (https://phabricator.wikimedia.org/T212989) (owner: 10Paladox) [00:55:51] (03PS7) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 [00:56:23] (03CR) 10Dzahn: [C: 03+2] phabricator: Add new cluster.mailers [puppet] - 10https://gerrit.wikimedia.org/r/482400 (https://phabricator.wikimedia.org/T212989) (owner: 10Paladox) [00:56:33] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (owner: 10Paladox) [00:57:37] (03PS8) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 [00:57:43] paladox: twentyafterfour: config change deployed (ack, it should not affect anything but lets test) [00:57:50] * paladox tests [00:57:52] eh wait.. that was 2001 [00:58:28] ok, now for real on 1001 [00:58:31] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (owner: 10Paladox) [00:58:42] ok [00:59:07] mutante comment on a task im subscribed too [00:59:14] comment can be removed after testing [01:01:17] https://phabricator.wikimedia.org/T200739#4937042 [01:01:30] mutante mail works! [01:01:42] paladox: ok:) thx [01:01:47] your welcome :) [01:02:00] i was a bad tester, i actually have web-based notifications [01:02:16] heh [01:04:03] paladox: i like that there is a hash tag for 2.16 now, thx https://gerrit.wikimedia.org/r/q/hashtag:%22gerrit-2.16%22+(status:open%20OR%20status:merged) [01:04:11] yup :) [01:07:59] !log powercycle crashed mw1299 via mgmt (garbled console output) (T215569) [01:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:04] T215569: mw1299 is down - https://phabricator.wikimedia.org/T215569 [01:09:52] RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [01:10:52] !log mw1299 has been down about 8 hours, does it need deployment.. depooling [01:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:10] Reedy: ^ [01:11:29] so if in the last 8 hours there was deployment, we need to deploy to just mw1299 [01:11:51] mutante: It's died a couple of times today it seems [01:11:51] or just wait until next regular deployment and then remember to repool [01:12:01] So I think it might want a bit more debugging [01:12:06] [mw1299:~] $ depool [01:12:06] Depooling all services on mw1299.eqiad.wmnet [01:12:09] ^ this does not log [01:12:22] Reedy: ok [01:12:42] I did file a task if you want to comment its depooled [01:13:06] 10Operations, 10ops-eqiad: mw1299 is down - https://phabricator.wikimedia.org/T215569 (10Dzahn) ` 20:12 < mutante> [mw1299:~] $ depool 20:12 < mutante> Depooling all services on mw1299.eqiad.wmnet ` [01:13:31] done [01:14:24] ta [01:15:26] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [01:15:36] 10Operations, 10ops-eqiad: mw1299 is down - https://phabricator.wikimedia.org/T215569 (10Dzahn) it's back up and running right now but depooled because this isn't the first time it happened on this machine [01:16:20] RECOVERY - Host mw1280 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [01:16:22] Reedy: "remove from dsh" still a thing nowadays ... [01:16:31] should i look for it [01:16:34] Is it? I've no idea :D [01:16:46] does scap foo still try and sync to it if it's depooled? :) [01:16:52] hieradata/common/scap/dsh.yaml: - mw1299.eqiad.wmnet [01:16:53] yes [01:17:04] afair [01:17:26] but it would be for deployers not having to skip that host.. only [01:18:07] (03CR) 10RobH: [C: 03+1] Icinga: add ping check for ulsfo PDUs [puppet] - 10https://gerrit.wikimedia.org/r/489113 (https://phabricator.wikimedia.org/T209101) (owner: 10Ayounsi) [01:18:16] this is a jobrunner-canary [01:18:32] and it's still callsed scap::dsh::groups [01:19:35] i think you are not affected since its not in "mediawiki-installation" [01:19:43] this case wouldnt need it then [01:19:52] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.72 seconds [01:20:10] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.45 seconds [01:20:12] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.45 seconds [01:20:17] (03PS9) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) [01:20:34] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.15 seconds [01:20:36] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.87 seconds [01:20:38] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.73 seconds [01:21:13] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:23:16] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 414.32 seconds [01:24:17] (03PS10) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) [01:24:23] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:25:06] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:25:42] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:26:41] (03PS11) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) [01:26:47] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:27:13] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:27:49] (03PS12) 10Paladox: [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) [01:27:51] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:28:29] (03CR) 10jerkins-bot: [V: 04-1] [wip] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:28:44] checked on dbtree only one server in s4 is really affected (db2051) and the other are all back to ok and this has happened sometimes in the past [01:29:41] (03CR) 10Paladox: "@Hashar would you be able to review this please?" [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:30:06] (03PS13) 10Paladox: zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) [01:30:33] (during backups) [01:30:48] (03CR) 10jerkins-bot: [V: 04-1] zuul: Convert to using scap [puppet] - 10https://gerrit.wikimedia.org/r/489012 (https://phabricator.wikimedia.org/T215458) (owner: 10Paladox) [01:31:06] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [01:31:08] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.52 seconds [01:31:08] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [01:31:34] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [01:31:36] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [01:31:36] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.37 seconds [01:31:40] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Dzahn) [01:31:54] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [01:32:11] and icinga caught up because it just checks every 5 min [01:37:47] !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1299.eqiad.wmnet [01:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:43] Reedy: i also did it the other way using conftctl to make sure ^ and that changed the actual state as opposed to running "depool" locally [01:40:21] and the old modules/scap/files/dsh/ is _almost_ gone but it's not a thing anymore [01:42:01] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Dzahn) ` [puppetmaster1001:~] $ sudo -i confctl depool --hostname mw1299.eqiad.wmnet eqiad/jobrunner/apache2/mw1299.eqiad.wmnet: pooled changed yes => no eqiad/jobrunner/nginx/mw129... [01:47:05] 10Operations, 10Mail, 10Phabricator, 10serviceops, and 2 others: Convert Phabricator mail config to use cluster.mailers - https://phabricator.wikimedia.org/T212989 (10Dzahn) deployed in production and we tested mail still works. this just adds the new config and does not remove the old config though [01:50:28] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.37 seconds [01:50:31] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.43 seconds [01:50:31] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.91 seconds [01:51:03] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 350.97 seconds [01:51:45] PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% [01:55:01] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) p:05High→03Normal lowering priority since Subbu is unblocked and can use the new box and we have switched varnish over. the remaining part is just... [01:57:51] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 515.61 seconds [01:57:51] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 515.65 seconds [01:58:09] (03PS1) 10Paladox: phabricator: Remove old mail config [puppet] - 10https://gerrit.wikimedia.org/r/489121 (https://phabricator.wikimedia.org/T212989) [02:00:02] (03PS2) 10Paladox: phabricator: Remove old mail config [puppet] - 10https://gerrit.wikimedia.org/r/489121 (https://phabricator.wikimedia.org/T212989) [02:01:03] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/489121 (https://phabricator.wikimedia.org/T212989) (owner: 10Paladox) [02:06:29] 10Operations, 10hardware-requests: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) Hi @faidon, let me explain. It was never a request for running 2 phabricator hosts in each datacenter. That's a misunderstanding. It's just that we want to reinstall pha... [02:08:15] 10Operations, 10hardware-requests: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) a:05Dzahn→03faidon [02:08:37] 10Operations, 10hardware-requests: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) P.S. running it in codfw is blocked on unrelated things (lack of dbproxy) and the host currently called phab1002 with 32GB would immediately go back to pool [02:08:37] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 790.37 seconds [02:35:55] (03PS1) 10Krinkle: tests: Assert that no computed lists are used in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489126 [02:36:24] (03PS1) 10Bstorm: toolforge: bastion cgroup limits need a big boost in resources [puppet] - 10https://gerrit.wikimedia.org/r/489127 (https://phabricator.wikimedia.org/T215434) [02:38:23] (03CR) 10Bstorm: "And so begins the great journey of giving enough resources...but not too much! :)" [puppet] - 10https://gerrit.wikimedia.org/r/489127 (https://phabricator.wikimedia.org/T215434) (owner: 10Bstorm) [02:38:43] (03CR) 10Bstorm: [C: 03+2] toolforge: bastion cgroup limits need a big boost in resources [puppet] - 10https://gerrit.wikimedia.org/r/489127 (https://phabricator.wikimedia.org/T215434) (owner: 10Bstorm) [02:39:09] (03CR) 10Jforrester: [C: 03+1] tests: Assert that no computed lists are used in wmf-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489126 (owner: 10Krinkle) [02:46:02] (03CR) 10Krinkle: [C: 03+2] tests: Assert that no computed lists are used in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489126 (owner: 10Krinkle) [02:47:14] (03Merged) 10jenkins-bot: tests: Assert that no computed lists are used in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489126 (owner: 10Krinkle) [02:50:44] (03CR) 10jenkins-bot: tests: Assert that no computed lists are used in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489126 (owner: 10Krinkle) [04:15:45] (03CR) 10Samwilson: "> The new Debian Stretch job grid is PHP 7.2." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [04:22:44] PROBLEM - Device not healthy -SMART- on db2053 is CRITICAL: cluster=mysql device=cciss,3 instance=db2053:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2053&var-datasource=codfw+prometheus/ops [04:46:27] (03PS1) 10Andrew Bogott: openstack: add some ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/489133 [04:47:20] (03PS1) 10BryanDavis: Move most code into jouncebot package [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489134 [04:47:22] (03PS1) 10BryanDavis: Toolforge kubernetes support [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489135 [04:47:24] (03CR) 10jerkins-bot: [V: 04-1] openstack: add some ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/489133 (owner: 10Andrew Bogott) [04:48:56] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:48:58] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:48:58] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [04:49:08] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:49:38] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:49:44] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [04:49:55] (03CR) 10BryanDavis: "> Is it okay to stay with the (new) job grid?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [04:49:57] (03PS2) 10Andrew Bogott: openstack: add some ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/489133 [04:49:58] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:50:16] (03CR) 10BryanDavis: [C: 04-1] Add all fonts used in production MediaWiki [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [04:52:51] (03CR) 10Andrew Bogott: [C: 03+2] openstack: add some ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/489133 (owner: 10Andrew Bogott) [04:53:09] (03CR) 10BryanDavis: [C: 03+2] Move most code into jouncebot package [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489134 (owner: 10BryanDavis) [04:53:41] (03Merged) 10jenkins-bot: Move most code into jouncebot package [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489134 (owner: 10BryanDavis) [04:53:48] (03CR) 10BryanDavis: [C: 03+2] Toolforge kubernetes support [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489135 (owner: 10BryanDavis) [04:54:17] (03Merged) 10jenkins-bot: Toolforge kubernetes support [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489135 (owner: 10BryanDavis) [05:10:37] (03Abandoned) 10Samwilson: Add all fonts used in production MediaWiki [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/488764 (https://phabricator.wikimedia.org/T213669) (owner: 10Samwilson) [05:13:56] (03PS1) 10BryanDavis: Update runner script and default config [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489138 [05:14:24] (03CR) 10jerkins-bot: [V: 04-1] Update runner script and default config [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489138 (owner: 10BryanDavis) [05:18:55] (03PS2) 10BryanDavis: Update runner script and default config [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489138 [05:19:14] (03CR) 10jerkins-bot: [V: 04-1] Update runner script and default config [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489138 (owner: 10BryanDavis) [05:24:10] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:24:22] (03PS3) 10BryanDavis: Update runner script and default config [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489138 [05:27:52] (03CR) 10BryanDavis: [C: 03+2] Update runner script and default config [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489138 (owner: 10BryanDavis) [05:28:13] (03Merged) 10jenkins-bot: Update runner script and default config [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489138 (owner: 10BryanDavis) [05:28:26] jouncebot: next [05:28:26] In 77 hour(s) and 1 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T1030) [05:28:51] * bd808 is about to migrate jouncebot to the Toolforge kubernetes cluster [05:47:34] :) [06:07:28] !log Drop staging.mep_word_persistence from dbstore1002 T215450 T213706 [06:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:34] T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002 - https://phabricator.wikimedia.org/T215450 [06:07:35] T213706: Convert Aria/Tokudb tables to InnoDB on dbstore1002 - https://phabricator.wikimedia.org/T213706 [06:11:09] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:11:52] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2053 is CRITICAL: cluster=mysql device=cciss,3 instance=db2053:9100 job=node site=codfw Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2053&var-datasource=codfw+prometheus/ops [06:12:57] (03PS1) 10BryanDavis: Fix py3: s/iteritems/items/ [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489143 [06:13:36] (03CR) 10BryanDavis: [C: 03+2] Fix py3: s/iteritems/items/ [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489143 (owner: 10BryanDavis) [06:13:40] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489144 (https://phabricator.wikimedia.org/T210713) [06:13:57] (03Merged) 10jenkins-bot: Fix py3: s/iteritems/items/ [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489143 (owner: 10BryanDavis) [06:16:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489144 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:17:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489144 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:18:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489144 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:21:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 (duration: 02m 58s) [06:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:45] !log Deploy schema change on db1098:3317 [06:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:31] (03PS1) 10BryanDavis: Run `2to3 -w` on codebase [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489145 [06:27:33] (03PS1) 10BryanDavis: Update .gitignore [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489146 [06:27:59] (03PS1) 10Marostegui: dbstore.my.cnf: Enable automatic slaves start [puppet] - 10https://gerrit.wikimedia.org/r/489147 (https://phabricator.wikimedia.org/T213670) [06:28:38] (03CR) 10BryanDavis: [C: 03+2] Run `2to3 -w` on codebase [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489145 (owner: 10BryanDavis) [06:28:39] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Marostegui) ` /admin1/system1/logs1/log1-> show record27 properties CreationTimestamp = 20190208014959.000000-360 ElementName = System Event Log Entry RecordData = CPU 1 mach... [06:28:43] (03CR) 10BryanDavis: [C: 03+2] Update .gitignore [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489146 (owner: 10BryanDavis) [06:28:59] (03Merged) 10jenkins-bot: Run `2to3 -w` on codebase [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489145 (owner: 10BryanDavis) [06:29:02] !log powercycle mw1299 - T215569 [06:29:03] (03Merged) 10jenkins-bot: Update .gitignore [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/489146 (owner: 10BryanDavis) [06:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:05] T215569: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 [06:30:08] jouncebot: next [06:30:08] In 75 hour(s) and 59 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T1030) [06:30:13] jouncebot: now [06:30:13] No deployments scheduled for the next 75 hour(s) and 59 minute(s) [06:30:19] jouncebot: refresh [06:30:21] I refreshed my knowledge about deployments. [06:30:43] 👍 [06:31:37] marostegui: it's fscked then? :P [06:31:46] yeah, looks like the CPU [06:31:53] I have restarted it and going to clean up logs [06:31:58] to make sure we start "fresh" [06:32:02] it had like 130 logs XD [06:32:11] some of them are quite old, so could be confusing [06:32:31] heh [06:32:37] RECOVERY - Host mw1299 is UP: PING WARNING - Packet loss = 44%, RTA = 0.26 ms [06:32:55] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/dhparam.pem] [06:36:06] marostegui: Amazingly, it's still in warranty too, for like 2 months [06:39:05] oh nice, let's comment on the task so we can rush before it expires [06:40:52] 2019-04-14 if racktables is correct [06:41:34] yeah, looks so on netbox too [06:43:29] RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:49:20] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Marostegui) a:03RobH This host is under warranty until April 14, 2019 so we might want to try to debug this before it expires in case we need some replacement CPU or mainboard. [06:51:14] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489148 [06:52:19] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489148 (owner: 10Marostegui) [06:53:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489148 (owner: 10Marostegui) [06:53:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489148 (owner: 10Marostegui) [06:54:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3317 (duration: 00m 49s) [06:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:35] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489149 (https://phabricator.wikimedia.org/T210713) [06:54:42] !log Take a mysqldump from staging on dbstore1003 from dbstore1002 - T210478 [06:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:44] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [06:56:01] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489149 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:57:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489149 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:58:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 (duration: 00m 46s) [06:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:07] !log Deploy schema change on db1094 T210713 [06:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:10] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:00:17] PROBLEM - HHVM rendering on mw2289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:25] RECOVERY - HHVM rendering on mw2289 is OK: HTTP OK: HTTP/1.1 200 OK - 80716 bytes in 0.308 second response time [07:02:41] PROBLEM - HHVM rendering on mw2177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:41] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:49] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 80716 bytes in 0.289 second response time [07:03:49] RECOVERY - HHVM rendering on mw2177 is OK: HTTP OK: HTTP/1.1 200 OK - 80716 bytes in 0.296 second response time [07:04:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489149 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:09:11] (03PS1) 10Vgutierrez: Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) [07:09:43] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#4932035, @EvanProdromou wrote: >... [07:10:54] (03CR) 10jerkins-bot: [V: 04-1] Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:12:43] !log Upgrade mysql and kernel on db1094 [07:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:55] PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% [07:13:06] heh, it didn't last long [07:17:27] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#4935629, @EvanProdromou wrote: >... [07:19:53] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Marostegui) And if crashed again with the same error: ` /admin1/system1/logs1/log1-> show record13 properties CreationTimestamp = 20190208071154.000000-360 ElementName = System... [07:20:17] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Merge http and https elasticsearch icinga checks into one - https://phabricator.wikimedia.org/T215587 (10Mathew.onipe) [07:20:29] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Merge http and https elasticsearch icinga checks into one - https://phabricator.wikimedia.org/T215587 (10Mathew.onipe) p:05Triage→03Normal a:03Mathew.onipe [07:21:03] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 883.27 seconds [07:21:51] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489151 [07:21:55] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.95 seconds [07:22:47] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.48 seconds [07:22:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489151 (owner: 10Marostegui) [07:22:57] (03PS2) 10Vgutierrez: Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) [07:23:53] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489151 (owner: 10Marostegui) [07:24:31] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 200.82 seconds [07:24:32] (03CR) 10jerkins-bot: [V: 04-1] Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:26:53] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489151 (owner: 10Marostegui) [07:27:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1094 (duration: 02m 56s) [07:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:53] !log marostegui@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1299.eqiad.wmnet [07:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:51] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 61.16 seconds [07:29:51] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489153 [07:32:20] (03CR) 10Elukey: [C: 03+1] dbstore.my.cnf: Enable automatic slaves start [puppet] - 10https://gerrit.wikimedia.org/r/489147 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui) [07:32:52] (03PS1) 10Mathew.onipe: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/489154 (https://phabricator.wikimedia.org/T212850) [07:33:07] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.70 seconds [07:36:03] (03PS2) 10Mathew.onipe: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/489154 (https://phabricator.wikimedia.org/T212850) [07:36:53] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10jijiki) [07:37:40] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489153 (owner: 10Marostegui) [07:38:43] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489153 (owner: 10Marostegui) [07:38:56] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489153 (owner: 10Marostegui) [07:39:08] (03CR) 10Marostegui: [C: 03+2] dbstore.my.cnf: Enable automatic slaves start [puppet] - 10https://gerrit.wikimedia.org/r/489147 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui) [07:41:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1094 (duration: 02m 55s) [07:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:32] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1299.eqiad.wmnet [07:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:35] (03PS3) 10Vgutierrez: Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) [07:51:24] (03CR) 10Vgutierrez: "This change is ready for review." [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:52:07] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489155 [07:53:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489155 (owner: 10Marostegui) [07:54:46] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489155 (owner: 10Marostegui) [07:55:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Full repool db1094 (duration: 00m 47s) [07:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489156 (https://phabricator.wikimedia.org/T210713) [07:59:13] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489155 (owner: 10Marostegui) [08:00:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489156 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:01:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489156 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:03:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 (duration: 00m 46s) [08:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:22] !log Upgrade MySQL on db1086 and deploy schema change [08:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:40] 10Operations, 10Wikimedia-Mailing-lists, 10User-jijiki: Please create docker-sig@ mailing list - https://phabricator.wikimedia.org/T215563 (10jijiki) p:05Triage→03Normal [08:10:19] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489156 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:15:59] !log Upgrade MySQL on db1086 [08:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:23] (03PS1) 10Jcrespo: mariadb: Depool db1083 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489157 [08:17:50] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1083 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489157 (owner: 10Jcrespo) [08:19:48] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1083 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489157 (owner: 10Jcrespo) [08:20:57] (03Merged) 10jenkins-bot: mariadb: Depool db1083 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489157 (owner: 10Jcrespo) [08:21:43] (03CR) 10jenkins-bot: mariadb: Depool db1083 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489157 (owner: 10Jcrespo) [08:23:33] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1083 (duration: 00m 47s) [08:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:49] !log stop and upgrade db1083 [08:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:49] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1083 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489158 [08:26:29] (03Abandoned) 10Elukey: profile::kafka::broker: fix cloud ferm ranges [puppet] - 10https://gerrit.wikimedia.org/r/471951 (owner: 10Elukey) [08:26:40] (03Abandoned) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [08:27:42] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489159 [08:28:38] (03PS1) 10Jcrespo: mariadb: Repool db1083 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489160 [08:28:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489159 (owner: 10Marostegui) [08:29:54] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489159 (owner: 10Marostegui) [08:30:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1086 (duration: 00m 47s) [08:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:50] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489159 (owner: 10Marostegui) [08:39:43] (03PS2) 10Jcrespo: mariadb: Repool db1083 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489160 [08:42:33] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:42:43] PROBLEM - Nginx local proxy to videoscaler on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.009 second response time [08:43:07] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool db1083 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489160 (owner: 10Jcrespo) [08:43:49] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time [08:43:59] RECOVERY - Nginx local proxy to videoscaler on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.032 second response time [08:44:08] (03Merged) 10jenkins-bot: mariadb: Repool db1083 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489160 (owner: 10Jcrespo) [08:44:21] (03CR) 10jenkins-bot: mariadb: Repool db1083 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489160 (owner: 10Jcrespo) [08:48:48] (03PS1) 10Vgutierrez: Add acmechief[12]001 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/489161 (https://phabricator.wikimedia.org/T207389) [08:49:16] (03PS1) 10Jcrespo: mariadb: Depool db1099 from s1 and s8 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489162 [08:50:37] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1083 with low load (duration: 00m 46s) [08:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:16] (03CR) 10Vgutierrez: [C: 03+2] Add acmechief[12]001 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/489161 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:52:07] (03PS2) 10Jcrespo: mariadb: Depool db1099 from s1 and s8 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489162 [08:53:10] !log reimage graphite2002 to buster [08:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:36] \o/ [08:59:54] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489163 [09:00:13] moritzm: any chance that I can do the same with stat1005?? :D [09:02:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489163 (owner: 10Marostegui) [09:03:14] (03PS1) 10Vgutierrez: install_server: Add DHCP entries for acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/489164 (https://phabricator.wikimedia.org/T207389) [09:03:38] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489163 (owner: 10Marostegui) [09:04:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1086 (duration: 00m 47s) [09:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:27] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489163 (owner: 10Marostegui) [09:05:51] elukey: still running into some installer issues, but when that's figured out, for sure! [09:06:44] !log installing libarchive security updates [09:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:52] (03CR) 10Faidon Liambotis: "This sounds like a good idea in general, but is there a plan for how users are supposed to validate (and by extension, invalidate) the hos" [dns] - 10https://gerrit.wikimedia.org/r/489103 (owner: 10Dzahn) [09:16:13] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489166 [09:16:23] !log installing rssh security updates [09:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489166 (owner: 10Marostegui) [09:20:14] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489166 (owner: 10Marostegui) [09:22:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1086 (duration: 00m 46s) [09:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:31] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: Merge http and https elasticsearch icinga checks into one - https://phabricator.wikimedia.org/T215587 (10Peachey88) [09:22:52] (03PS3) 10Jcrespo: mariadb: Depool db1099 from s1 and s8 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489162 [09:26:11] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1099 from s1 and s8 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489162 (owner: 10Jcrespo) [09:27:19] (03Merged) 10jenkins-bot: mariadb: Depool db1099 from s1 and s8 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489162 (owner: 10Jcrespo) [09:28:51] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489166 (owner: 10Marostegui) [09:28:53] (03CR) 10jenkins-bot: mariadb: Depool db1099 from s1 and s8 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489162 (owner: 10Jcrespo) [09:28:55] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099 (duration: 00m 46s) [09:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:21] 10Operations: confd: Superfluous golang dependency - https://phabricator.wikimedia.org/T215593 (10MoritzMuehlenhoff) [09:34:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489167 (https://phabricator.wikimedia.org/T210713) [09:34:50] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/14583/" [puppet] - 10https://gerrit.wikimedia.org/r/488436 (owner: 10Muehlenhoff) [09:35:51] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489167 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:36:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489167 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:37:45] 10Operations, 10Maps (Kartotherian): Create discovery entry for Kartotherian - https://phabricator.wikimedia.org/T214672 (10Gehel) 05Open→03Invalid Discovery entry is only used for internal communications, but not by varnish (which confused me quite a bit). So we do have the proper entries in place, let's... [09:37:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3317 (duration: 00m 46s) [09:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489167 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:42:22] (03PS1) 10Elukey: Add analytics dbstore SRV records [dns] - 10https://gerrit.wikimedia.org/r/489170 [09:43:57] (03PS1) 10Muehlenhoff: Extend d-i config for buster [puppet] - 10https://gerrit.wikimedia.org/r/489171 (https://phabricator.wikimedia.org/T213527) [09:51:54] (03PS10) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [09:52:40] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [09:55:30] (03PS2) 10Daimona Eaytoy: Remove $wgAbuseFilterRuntimeProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (https://phabricator.wikimedia.org/T191039) [09:57:00] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10jcrespo) [09:57:23] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10jcrespo) [10:09:18] (03PS11) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [10:10:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489174 [10:10:55] (03CR) 10Filippo Giunchedi: [C: 03+1] Icinga: add ping check for ulsfo PDUs [puppet] - 10https://gerrit.wikimedia.org/r/489113 (https://phabricator.wikimedia.org/T209101) (owner: 10Ayounsi) [10:11:49] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489174 (owner: 10Marostegui) [10:12:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489174 (owner: 10Marostegui) [10:13:50] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489174 (owner: 10Marostegui) [10:14:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3317 (duration: 00m 47s) [10:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:46] !log stop and upgrade db1099 [10:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:17] (03CR) 10Gehel: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [10:23:38] !log swift codfw-prod: more weight to ms-be2047 - T209395 T209921 [10:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:42] T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 [10:23:43] T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 [10:24:05] (03PS1) 10Effie Mouzeli: Apply -R 200 to memcached on mc1026 [puppet] - 10https://gerrit.wikimedia.org/r/489175 (https://phabricator.wikimedia.org/T208844) [10:27:19] !log Restarting memcached on mc1026 to apply '-R 200' - T208844 [10:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:22] T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 [10:27:36] (03CR) 10Effie Mouzeli: [C: 03+2] Apply -R 200 to memcached on mc1026 [puppet] - 10https://gerrit.wikimedia.org/r/489175 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [10:28:50] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/488436 (owner: 10Muehlenhoff) [10:29:17] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 724.90 seconds [10:29:51] (03PS11) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [10:30:21] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [10:31:41] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 3 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) [10:43:37] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1099 from s1 and s8 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489178 [10:46:04] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1099 from s1 and s8 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489178 (owner: 10Jcrespo) [10:47:13] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1099 from s1 and s8 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489178 (owner: 10Jcrespo) [10:47:52] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1099 from s1 and s8 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489178 (owner: 10Jcrespo) [10:50:13] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099 (duration: 00m 47s) [10:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:27] (03PS12) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [10:53:13] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1083 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489158 [10:53:49] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 239.49 seconds [10:57:21] (03CR) 10Marostegui: [C: 03+1] "The mapping is correct." [dns] - 10https://gerrit.wikimedia.org/r/489170 (owner: 10Elukey) [10:58:39] 10Operations, 10serviceops, 10User-jijiki: Fix spamassassin's "warn: netset: cannot include " warning - https://phabricator.wikimedia.org/T215496 (10jijiki) 05Open→03Resolved a:03jijiki Resolved by @akosiaris in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/488894/ [10:59:10] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1083 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489158 (owner: 10Jcrespo) [11:00:15] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1083 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489158 (owner: 10Jcrespo) [11:01:02] (03PS2) 10Muehlenhoff: Extend d-i config for buster [puppet] - 10https://gerrit.wikimedia.org/r/489171 (https://phabricator.wikimedia.org/T213527) [11:04:18] (03CR) 10Jbond: [C: 03+1] "this seems perfectly legit use of SRV records to me" [dns] - 10https://gerrit.wikimedia.org/r/489170 (owner: 10Elukey) [11:04:46] (03CR) 10Effie Mouzeli: [C: 03+2] redis: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/488436 (owner: 10Muehlenhoff) [11:05:07] (03PS2) 10Effie Mouzeli: redis: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/488436 (owner: 10Muehlenhoff) [11:05:24] (03CR) 10Muehlenhoff: [C: 03+2] Extend d-i config for buster [puppet] - 10https://gerrit.wikimedia.org/r/489171 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [11:06:22] (03PS3) 10Effie Mouzeli: redis: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/488436 (owner: 10Muehlenhoff) [11:08:19] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1083 fully (duration: 00m 47s) [11:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:15] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1083 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489158 (owner: 10Jcrespo) [11:18:26] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10MoritzMuehlenhoff) As discussed on IRC: Let's upgrade to 2.7.1 next week as that fixes a security issue (CVE-2019-3826) in the internal UI (not expos... [11:18:45] 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10jbond) discussion from meeting https://etherpad.wikimedia.org/p/SRE-Foundations-ulogd_discussion The conclusion was to have iptable drop standard entries sent to syslog, as syslog is already in the logg... [11:19:03] 10Operations, 10monitoring: Expose linux kernel firewall and connections statistics - https://phabricator.wikimedia.org/T215277 (10jbond) discussion from meeting https://etherpad.wikimedia.org/p/SRE-Foundations-ulogd_discussion The conclusion was to have iptable drop standard entries sent to syslog, as syslo... [11:29:46] (03CR) 10Alexandros Kosiaris: "Would a comment instead of a removal work then?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris) [11:36:15] !log reimage graphite2002 to buster [11:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:00] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install graphite2003 - https://phabricator.wikimedia.org/T196483 (10fgiunchedi) [11:40:02] 10Operations, 10ops-codfw, 10monitoring: graphite2001 crashed - https://phabricator.wikimedia.org/T198041 (10fgiunchedi) 05Open→03Declined Host is going to be decom -- declining [11:51:30] (03PS4) 10D3r1ck01: Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) [11:52:19] (03CR) 10D3r1ck01: "PS4 is **only** a manual rebase!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01) [11:55:43] (03CR) 10Filippo Giunchedi: "LGTM overall, some food-for-thought questions (i.e. ok to followup/address later and on phabricator)" [dns] - 10https://gerrit.wikimedia.org/r/489170 (owner: 10Elukey) [12:06:04] (03PS26) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [12:07:04] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:07:56] (03PS27) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [12:08:49] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:12:14] (03CR) 10Elukey: "> LGTM overall, some food-for-thought questions (i.e. ok to" [dns] - 10https://gerrit.wikimedia.org/r/489170 (owner: 10Elukey) [12:17:51] PROBLEM - Host db1114 is DOWN: PING CRITICAL - Packet loss = 100% [12:18:51] is anybody working on --^ ? [12:18:55] don't see anything in the SAL [12:18:59] Cc: marostegui, jynus [12:19:58] from https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php it seems pooled for s1 right? (slave of course) [12:23:20] (03PS1) 10Elukey: Depool db1114 - host down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489189 [12:24:27] (03PS2) 10Elukey: Depool db1114 - host down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489189 [12:25:21] (03CR) 10Muehlenhoff: Introduce systemd::slice::all_users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [12:27:54] db1114 seems to have crashed due to memory errors, there's several "Critical" errors in SEL for DIMM_B7 and DIMM_B3 [12:28:18] yeah I think it is not the first time it crashed, it happened a week ago (from SAL) [12:28:23] "Multi-bit memory errors detected on a memory device" [12:28:38] right, just found https://phabricator.wikimedia.org/T214720 [12:29:46] (03PS3) 10Elukey: Depool db1114 - host down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489189 (https://phabricator.wikimedia.org/T214720) [12:29:50] I think my patch should be ok, there is already another shard for the api read traffic [12:31:26] probably yes [12:32:55] calling manuel [12:34:07] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10MoritzMuehlenhoff) The server went down at 12:16, with a number of memory errors logged in SEL: ` ------------------------------------------------------------------------------- Record:... [12:34:17] trying with Jaime [12:34:19] (03PS13) 10BBlack: WIP: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [12:34:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Depool db1114 - host down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489189 (https://phabricator.wikimedia.org/T214720) (owner: 10Elukey) [12:34:57] <_joe_> elukey: jfdi :) [12:35:01] ack. the serial console is also dead, I see a few characters of garbage output, but nothing else [12:35:17] _joe_ I don't recall exactly how to deploy :P [12:35:26] <_joe_> is enwiki working? meaning you can edit? [12:35:36] that is a slave, only read traffic [12:35:46] <_joe_> elukey: that sadly didn't matter in the past [12:35:58] ah yes but it should be fixed no? [12:36:03] <_joe_> should be [12:36:06] <_joe_> might be [12:36:08] <_joe_> :) [12:36:31] <_joe_> anyways, you +2 the patch, wait for gate-and-submit, go on deploy1001 [12:36:41] (03CR) 10Elukey: [C: 03+2] Depool db1114 - host down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489189 (https://phabricator.wikimedia.org/T214720) (owner: 10Elukey) [12:36:44] <_joe_> and well, jijiki knows what to do if she's around [12:36:49] <_joe_> else, I can assist [12:37:30] (03PS28) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [12:37:36] what did I miss? [12:37:52] <_joe_> you need to pull the code in /srv/mediawiki-staging and do a scap sync-file, see https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Small_changes:_sync_individual_files_or_directories [12:38:03] <_joe_> jijiki: elukey needs to deploy 1 file via scap [12:38:31] good, we can experiment together [12:38:59] (03CR) 10jenkins-bot: Depool db1114 - host down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489189 (https://phabricator.wikimedia.org/T214720) (owner: 10Elukey) [12:39:07] elukey: so _joe_ by saying "knows what to do" means "looking for trouble: [12:39:09] " [12:39:46] :D [12:40:01] ok so there's nothing pending in mediawiki-staging [12:40:56] so I have to cd into wmf-config, pull, scap sync-file blabla right? [12:41:11] (03PS14) 10BBlack: WIP: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [12:41:31] let's tmux on deploy1001 [12:42:08] <_joe_> elukey: not in wmf-config, in the main dir [12:42:09] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.8295 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:42:10] we have to remote update, check diffs [12:42:22] elukey: in /srv/mediawiki-staging [12:42:25] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:43:05] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2456 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:44:35] (03PS29) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [12:44:49] !log elukey@deploy1001 Synchronized wmf-config/db-eqiad.php: depooling db1114, host down (duration: 00m 47s) [12:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:32] all right seems done [12:46:06] ms-be2020 was just the session scope bug under high load, fixed [12:46:19] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [12:46:59] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2406 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:47:43] om anyone knows about [12:47:45] https://grafana.wikimedia.org/d/000000561/logstash?orgId=1 [12:47:47] this? [12:48:26] (03PS1) 10Joal: Update aqs druid datasource to 2019_01 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/489194 [12:50:03] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1429 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:51:38] !log disabling notifications on db1114 [12:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:41] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.01983 https://grafana.wikimedia.org/dashboard/db/logstash [12:53:32] (03CR) 10Elukey: [C: 03+2] Update aqs druid datasource to 2019_01 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/489194 (owner: 10Joal) [12:55:45] (03PS1) 10Arturo Borrero Gonzalez: toolforge: grid: exec_environ: use libmariadbclient* packages [puppet] - 10https://gerrit.wikimedia.org/r/489195 (https://phabricator.wikimedia.org/T215578) [12:56:16] (03PS2) 10Arturo Borrero Gonzalez: toolforge: grid: exec_environ: use libmariadbclient-dev* packages [puppet] - 10https://gerrit.wikimedia.org/r/489195 (https://phabricator.wikimedia.org/T215578) [12:57:11] (03CR) 10Elukey: Introduce systemd::slice::all_users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [12:59:03] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [13:00:01] (03CR) 10Muehlenhoff: Introduce systemd::slice::all_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [13:01:16] (03CR) 10Elukey: Introduce systemd::slice::all_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [13:04:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudelastic1004.eqiad.wmnet ` The log can be fou... [13:04:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudelastic1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudelastic1004.eqiad.wmnet'] ` [13:05:09] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.3605 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [13:05:09] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:05:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudelastic1004.eqiad.wmnet ` The log can be fou... [13:05:31] jijiki: are you looking at the logstash overload? [13:05:44] gehel: yeah with jynu.s [13:05:54] !log T209029 reimaging cloudelastic1004 [13:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:58] T209029: cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 [13:06:16] jijiki: ok, I probably don't know much more than you about it, but ping me if you need another set of eyes [13:06:23] tx tx :)\ [13:09:01] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.03716 https://grafana.wikimedia.org/dashboard/db/logstash [13:15:20] (03PS12) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [13:15:32] (03PS1) 10Jcrespo: mariadb: Pool rc slaves with higher weight to rebalance load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489200 (https://phabricator.wikimedia.org/T214720) [13:16:18] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [13:17:38] (03CR) 10GTirloni: [C: 03+2] toolforge: grid: exec_environ: use libmariadbclient-dev* packages [puppet] - 10https://gerrit.wikimedia.org/r/489195 (https://phabricator.wikimedia.org/T215578) (owner: 10Arturo Borrero Gonzalez) [13:20:21] (03PS7) 10Elukey: Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) [13:21:09] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Stop NavPopups gadget conflict with PagePreviews on Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01) [13:21:17] PROBLEM - Nginx local proxy to apache on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:09] RECOVERY - Nginx local proxy to apache on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.211 second response time [13:31:49] (03CR) 10Arturo Borrero Gonzalez: "When dropping the code in Toolforge, there is leftover stale file from file(profile/toolforge/bastion-root-resource-control.conf). Please " [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [13:33:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudelastic1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudelastic1004.eqiad.wmnet'] ` [13:34:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudelastic1004.eqiad.wmnet ` The log can be fou... [13:34:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudelastic1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudelastic1004.eqiad.wmnet'] ` [13:35:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudelastic1004.eqiad.wmnet ` The log can be fou... [13:37:40] !log roll restart of aqs on aqs1* to pick up new druid backend changes [13:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:49] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational [13:39:53] !log starting osm-initial-import for maps2004 which is the newly migrated to stretch master - T198622 [13:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:56] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [13:44:18] (03PS1) 10Urbanecm: New throttle rule + removal of expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489207 (https://phabricator.wikimedia.org/T215446) [13:44:58] Hi, anybody to deploy https://gerrit.wikimedia.org/r/489207 for T215446? It's throttle rule for tomorrow [13:44:59] T215446: Request for temporary lift of account creation cap for Wikipedia edit-a-thon event (Feb 9) - https://phabricator.wikimedia.org/T215446 [13:45:40] !log racadm serveraction powercycle db1114 [13:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:51] (03Abandoned) 10Zppix: Remove past throttles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487881 (owner: 10Zppix) [13:46:24] (03PS3) 10Zppix: Lift Account creation cap for Women Activists edit-a-thon at Simmons University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487876 (https://phabricator.wikimedia.org/T215069) [13:47:54] (03CR) 10D3r1ck01: "Thanks for catching that." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01) [13:48:00] 10Operations, 10MediaWiki-Database, 10monitoring: MediaWiki errors overloading logtash - https://phabricator.wikimedia.org/T215611 (10Marostegui) [13:49:40] (03PS5) 10D3r1ck01: Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) [13:49:50] 10Operations, 10MediaWiki-Database, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10monitoring: MediaWiki errors overloading logtash - https://phabricator.wikimedia.org/T215611 (10jcrespo) [13:50:12] (03CR) 10Thiemo Kreuz (WMDE): "I'm quite a bit worried to see we are starting a "black list of bad words" here. When will we stop expanding this list? How did it started" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [13:52:03] 10Operations, 10MediaWiki-Database, 10Wikimedia-Logstash, 10monitoring, 10Wikimedia-production-error: MediaWiki errors overloading logtash - https://phabricator.wikimedia.org/T215611 (10jcrespo) To clarify "lag behind"- it created at least 20 minutes of lag, which would have blocked any mediawiki [13:56:47] !log updated firmware-enriched buster netboot image to 20190208 daily build, the alpha5 image no longer works as Linux 4.19.16-1 bumped the ABI and migrated to testing yesterday [13:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:49] (03PS1) 10Marostegui: db1114: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/489210 (https://phabricator.wikimedia.org/T214720) [13:57:31] (03CR) 10Jcrespo: [C: 03+1] db1114: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/489210 (https://phabricator.wikimedia.org/T214720) (owner: 10Marostegui) [13:57:49] (03CR) 10Marostegui: [C: 03+2] db1114: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/489210 (https://phabricator.wikimedia.org/T214720) (owner: 10Marostegui) [14:04:45] (03PS1) 10Bmansurov: Add page-links-change event to EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/489211 (https://phabricator.wikimedia.org/T214706) [14:05:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10aborrero) I'm having troubles reimaging the server: ` aborrero@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T209029 --no-verify --no-downtime --no-reboot... [14:06:43] (03PS1) 10Alexandros Kosiaris: scaffolding: Fix deployment indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/489212 [14:07:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] add statsd_exporter config to mathoid (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/482718 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [14:08:01] (03PS1) 10Jcrespo: mariadb: Switch db1114 and db1118 roles [puppet] - 10https://gerrit.wikimedia.org/r/489213 (https://phabricator.wikimedia.org/T214720) [14:08:24] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] scaffolding: Fix deployment indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/489212 (owner: 10Alexandros Kosiaris) [14:10:17] (03PS1) 10Jcrespo: install_server: Allow full reimage of db1114 [puppet] - 10https://gerrit.wikimedia.org/r/489214 (https://phabricator.wikimedia.org/T214720) [14:11:18] (03CR) 10Urbanecm: [C: 04-1] Lift Account creation cap for Women Activists edit-a-thon at Simmons University (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487876 (https://phabricator.wikimedia.org/T215069) (owner: 10Zppix) [14:11:48] (03CR) 10Marostegui: [C: 03+1] install_server: Allow full reimage of db1114 [puppet] - 10https://gerrit.wikimedia.org/r/489214 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [14:12:28] (03PS2) 10Elukey: Add analytics dbstore SRV records [dns] - 10https://gerrit.wikimedia.org/r/489170 (https://phabricator.wikimedia.org/T212386) [14:12:39] (03PS6) 10DCausse: [WIP] Upgrade to 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/446869 (https://phabricator.wikimedia.org/T199791) [14:12:41] (03PS3) 10DCausse: [WIP] Add nori korean analyzer [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/486266 (https://phabricator.wikimedia.org/T206874) [14:13:14] (03CR) 10Marostegui: [C: 03+1] mariadb: Switch db1114 and db1118 roles [puppet] - 10https://gerrit.wikimedia.org/r/489213 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [14:22:26] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Wikimedia-Logstash, and 2 others: MediaWiki errors overloading logtash - https://phabricator.wikimedia.org/T215611 (10CDanis) I don't feel nearly well-versed in PHP/PSR-3/Monolog nor the MW codebase to suggest implementations, but it seems to me t... [14:23:28] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01) [14:27:15] (03CR) 10Jcrespo: [C: 03+2] install_server: Allow full reimage of db1114 [puppet] - 10https://gerrit.wikimedia.org/r/489214 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [14:27:44] (03PS2) 10Jcrespo: install_server: Allow full reimage of db1114 [puppet] - 10https://gerrit.wikimedia.org/r/489214 (https://phabricator.wikimedia.org/T214720) [14:34:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10MoritzMuehlenhoff) > The debian installer completes, but I can't log in because apparently the first puppet run isn't completed and I can't use any login me... [14:35:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudelastic1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudelastic1004.eqiad.wmnet'] ` [14:37:51] (03PS2) 10Jcrespo: mariadb: Switch db1114 and db1118 roles [puppet] - 10https://gerrit.wikimedia.org/r/489213 (https://phabricator.wikimedia.org/T214720) [14:43:01] (03PS1) 10GTirloni: wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/489220 (https://phabricator.wikimedia.org/T212308) [14:43:31] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/489220 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [14:43:54] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists, 10User-jijiki: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10jijiki) @Psychoslave I have sent you the new password, please let me know if you got an automated one as well. Let us know if everything is ok:) [14:44:28] (03PS2) 10GTirloni: wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/489220 (https://phabricator.wikimedia.org/T212308) [14:47:41] (03CR) 10Marostegui: [C: 03+1] wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/489220 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [14:49:57] (03CR) 10GTirloni: [C: 03+2] wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/489220 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [14:51:06] !log Reload haproxy on dbproxy1011 to depool labsdb1009 - https://phabricator.wikimedia.org/T212308 [14:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:21] gtirloni: ^ [14:51:25] (03PS13) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [14:52:18] thanks! [14:52:23] (03CR) 10jerkins-bot: [V: 04-1] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [15:07:39] Hi, anybody to deploy https://gerrit.wikimedia.org/r/489207 for T215446? It's throttle rule for tomorrow [15:07:40] T215446: Request for temporary lift of account creation cap for Wikipedia edit-a-thon event (Feb 9) - https://phabricator.wikimedia.org/T215446 [15:08:37] MaxSem, twentyafterfour, RoanKattouw, dereckson, thcipriani, Niharika, zeljkof, Reedy ? [15:08:51] (03PS3) 10Jcrespo: mariadb: Switch db1114 and db1118 roles [puppet] - 10https://gerrit.wikimedia.org/r/489213 (https://phabricator.wikimedia.org/T214720) [15:10:55] !log Upgrading php-redis 4.1.1 to mwmaint1002 - T215376 [15:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:58] T215376: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 [15:11:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1118.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20190208151... [15:14:44] (03CR) 10Jcrespo: [C: 03+2] mariadb: Switch db1114 and db1118 roles [puppet] - 10https://gerrit.wikimedia.org/r/489213 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [15:18:49] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10RStallman-legalteam) Just confirming that the Master Services Agreement and Data Processing Agreem... [15:23:29] (03PS1) 10Andrew Bogott: nova: add wmcs-rescue-console.sh to compute hosts [puppet] - 10https://gerrit.wikimedia.org/r/489230 (https://phabricator.wikimedia.org/T215211) [15:28:38] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1118.eqiad.wmnet'] ` and were **ALL** successful. [15:30:58] (03PS8) 10Elukey: Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) [15:31:09] (03CR) 10Ottomata: [C: 03+1] "Cool!" [dns] - 10https://gerrit.wikimedia.org/r/489170 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [15:31:15] <_joe_> !log upgraded all php extensions to php 7.2 compatible versions on mwmaint1002 [15:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:46] !log imported debmonitor 0.1.5-1+deb10u1 to buster-wikimedia (T213527) [15:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:48] T213527: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 [15:32:16] Urbanecm: sorry, just saw your ping [15:32:43] zeljkof, np, I'd like to get https://gerrit.wikimedia.org/r/489207 deployed extraordinaly [15:32:47] can you put it in an US SWAT, since it's throttle, it can be deployed without you, right? [15:32:50] (03PS14) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [15:33:35] (03CR) 10Elukey: "> When dropping the code in Toolforge, there is leftover stale file" [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [15:33:36] <_joe_> !log apt-get upgrade on mwmaint2001 to fix the php installation T215376 [15:33:38] Which US SWaT? [15:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:39] T215376: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 [15:33:43] zeljkof, [15:33:56] I don't see an US SWAT window between now and tomorrow [15:33:58] ah, it's friday :d [15:34:00] :D [15:34:20] sorry, totally confused [15:34:35] (03PS15) 10BBlack: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [15:34:42] np :D [15:35:17] PROBLEM - DPKG on mwmaint2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:35:19] Urbanecm: uh, so since it's Friday, I think we need permission from greg-g, I don't think I've ever deployed anything on Friday [15:35:25] even throttle rule :/ [15:35:42] I'm pretty sure somebody did so sometime [15:35:46] not 100% sure through [15:37:09] (03PS2) 10Andrew Bogott: nova: add wmcs-rescue-console.sh to compute hosts [puppet] - 10https://gerrit.wikimedia.org/r/489230 (https://phabricator.wikimedia.org/T215211) [15:37:30] (03PS3) 10Andrew Bogott: nova: add wmcs-rescue-console.sh to compute hosts [puppet] - 10https://gerrit.wikimedia.org/r/489230 (https://phabricator.wikimedia.org/T215211) [15:37:34] PROBLEM - HTTP-noc on mwmaint2001 is CRITICAL: connect to address 10.192.48.45 and port 80: Connection refused [15:38:48] RECOVERY - HTTP-noc on mwmaint2001 is OK: HTTP OK: HTTP/1.1 200 OK - 3516 bytes in 0.073 second response time [15:39:44] (03PS1) 10Muehlenhoff: Only enable backports up to stretch [puppet] - 10https://gerrit.wikimedia.org/r/489237 [15:39:55] (03CR) 10BBlack: [C: 03+1] "A few nits fixed (= vs ==, missing semicolon), and all the VTC tests now pass. Will sync up again today before merging." [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [15:40:36] (03CR) 10Dr0ptp4kt: "Thanks @bblack for the fixes!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [15:42:12] bblack: thx for the fixes. i'm available during the next 77 minutes if you want to sync [15:42:32] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10jijiki) @EvanProdromou I will try to pull you some st... [15:42:51] 10Operations, 10Icinga, 10monitoring: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 (10jcrespo) 05Resolved→03Open We believe we had a new case of this on the 2018-02-08 :-( [15:44:21] (03PS1) 10GTirloni: Revert "wiki replicas: depool labsdb1009 for updates" [puppet] - 10https://gerrit.wikimedia.org/r/489239 (https://phabricator.wikimedia.org/T212308) [15:44:58] (03CR) 10GTirloni: [C: 03+2] Revert "wiki replicas: depool labsdb1009 for updates" [puppet] - 10https://gerrit.wikimedia.org/r/489239 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [15:46:02] !log Repool labsdb1009 - T212308 [15:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:06] T212308: Rerun maintain-views for all tables to drop valid_tag and tag_summary tables - https://phabricator.wikimedia.org/T212308 [15:46:52] dr0ptp4kt: ok yeah, mainly I just wanted you to be here so you can verify live functionality after the deploy, and/or help hold up the flameshield if everything melts [15:48:26] (03CR) 10Thcipriani: New throttle rule + removal of expired rules (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489207 (https://phabricator.wikimedia.org/T215446) (owner: 10Urbanecm) [15:48:41] dr0ptp4kt: just say when and I'll start rebase -> merge -> deploy stuff [15:48:55] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Joe) `All the extensions were not upgraded at the time we did the 7.0 => 7.2 transition - my bad!... [15:49:01] bblack: i'm ready [15:50:23] dr0ptp4kt: ok, going! [15:50:39] (03PS16) 10BBlack: Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [15:50:45] RECOVERY - DPKG on mwmaint2001 is OK: All packages OK [15:50:53] 10Operations, 10Patch-For-Review: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 (10MoritzMuehlenhoff) Still some rough edges to sort out, but bare metal installations are working now: ` $ ssh graphite2002.codfw.wmnet Linux graphite2002 4.19.0-2-amd64 #1 SMP Debia... [15:51:24] (03CR) 10BBlack: [C: 03+2] Add a Google Translate-specific redirect-to-mobile [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [15:53:17] (03CR) 10Arturo Borrero Gonzalez: "Good job! This is indeed a great idea!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/489230 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [15:56:01] PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:56:07] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:56:23] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 48 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:56:50] bblack: ^ could that be your change? [15:56:53] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:57:07] it is [15:57:20] I don't think the change is actually faulty, though [15:57:36] it's the standard bullshit race condition on deploying a new file and referencing that new file all in the same puppet patch [15:57:47] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:57:51] also to boot, the cumin of "run-puppet-agent -q" claimed they were all successful :P [15:58:01] PROBLEM - puppet last run on cp1083 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:01] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:05] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:05] PROBLEM - puppet last run on cp4031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:05] PROBLEM - puppet last run on cp4030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:13] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:15] PROBLEM - puppet last run on cp5007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:23] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:24] ah so a second run should fix it? [15:58:41] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:41] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:41] PROBLEM - puppet last run on cp4029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:47] PROBLEM - puppet last run on cp4032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:49] PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:49] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:49] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:49] yeah, hopefully. I'm waiting a couple of minutes first, to let the puppet master/fileserver catch up to reality for sure [15:58:55] PROBLEM - puppet last run on cp1088 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:55] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:58:55] (03PS1) 10Anomie: wiki replicas: Remove reference to old comment fields [puppet] - 10https://gerrit.wikimedia.org/r/489242 (https://phabricator.wikimedia.org/T212972) [15:59:18] oh wait, maybe it's not the race condition heh! [15:59:33] PROBLEM - puppet last run on cp1085 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:59:39] maybe it's that I tested all the diffs manually, but the patch is critically missing the code change to actually deploy the new file :P [15:59:42] I tried a second run on cp1083 and failed yep [15:59:45] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:59:45] PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:59:45] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:59:59] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [15:59:59] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:07] PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:07] PROBLEM - puppet last run on cp5002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:19] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:19] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:23] PROBLEM - puppet last run on cp4027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:29] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:29] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:29] PROBLEM - puppet last run on cp5009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:40] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: add basic camus support [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) [16:00:49] PROBLEM - puppet last run on cp1079 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:00:53] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:01:12] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_test_cluster::coordinator: add basic camus support [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:01:13] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:01:13] PROBLEM - puppet last run on cp1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:01:21] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:01:51] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:03:07] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:03:27] (03PS1) 10BBlack: Bugfix for prev commit 6c0cea96 [puppet] - 10https://gerrit.wikimedia.org/r/489244 (https://phabricator.wikimedia.org/T212197) [16:03:31] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:03:54] (03PS2) 10BBlack: Bugfix for prev commit 22c6cd56 [puppet] - 10https://gerrit.wikimedia.org/r/489244 (https://phabricator.wikimedia.org/T212197) [16:03:58] (03CR) 10jerkins-bot: [V: 04-1] Bugfix for prev commit 22c6cd56 [puppet] - 10https://gerrit.wikimedia.org/r/489244 (https://phabricator.wikimedia.org/T212197) (owner: 10BBlack) [16:04:09] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:04:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10GTirloni) >>! In T209029#4937948, @aborrero wrote: > **WMCS needs discussion**: what do we want to do with this server? can it live with `spare::system` for... [16:04:26] (03CR) 10jerkins-bot: [V: 04-1] Bugfix for prev commit 22c6cd56 [puppet] - 10https://gerrit.wikimedia.org/r/489244 (https://phabricator.wikimedia.org/T212197) (owner: 10BBlack) [16:04:48] Line 1: Do not define bug in the header [16:04:55] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:05:03] we can't say the word bugfix in a commit title :) [16:05:11] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:05:12] bblack: oh for crying out loud, i'm sorry i didn't think of that [16:05:35] (03PS3) 10BBlack: Add file resource for translation-engine.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/489244 (https://phabricator.wikimedia.org/T212197) [16:05:44] it's ok, it doesn't actually hurt anything, just spams the channel [16:05:59] I was about to ask, saw 68 criticals [16:06:06] no big issue I assume [16:06:13] they're just puppetfails to load new VCL, which leaves the existing VCL running [16:06:44] so they're critical in the sense of "someone needs to fix this pronto", but not in the sense of functional site issues [16:06:57] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:06:57] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:06:57] (03CR) 10BBlack: [C: 03+2] Add file resource for translation-engine.inc.vcl [puppet] - 10https://gerrit.wikimedia.org/r/489244 (https://phabricator.wikimedia.org/T212197) (owner: 10BBlack) [16:06:58] yes, I got it [16:07:03] bblack: yeah. i'm having flashbacks to learning puppet [16:07:43] PROBLEM - puppet last run on cp5003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:07:43] well what's really missing here is true CI integration for our VCL stuff that would've caught this [16:08:04] but there's little point investing heavily in that direction at this point. We've been living this way for a few years and it will all go away eventually. [16:08:10] !log imported git-fat 0.1.3-2+deb10u1 to buster-wikimedia (T213527) [16:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:13] T213527: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 [16:08:31] RECOVERY - puppet last run on cp1083 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:08:55] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:09:00] !log stopping s1 replication on dbstore1001 to speed up cloning T214720 [16:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:03] T214720: db1114 crashed - https://phabricator.wikimedia.org/T214720 [16:09:23] PROBLEM - puppet last run on cp5006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:10:47] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:10:47] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:10:57] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:11:01] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:11:15] RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:11:35] PROBLEM - puppet last run on cp4022 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:11:39] RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:41] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:11:45] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:11:49] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:12:03] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:12:03] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:23] PROBLEM - puppet last run on cp1076 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:12:31] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:12:43] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:13:37] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:13:49] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:13:55] RECOVERY - puppet last run on cp4031 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:13:55] RECOVERY - puppet last run on cp4030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:13:59] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:01] RECOVERY - puppet last run on cp5007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:14:11] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:27] (03PS2) 10Elukey: role::analytics_test_cluster::coordinator: add basic camus support [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) [16:14:29] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:29] RECOVERY - puppet last run on cp4029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:14:37] RECOVERY - puppet last run on cp4032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:39] RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:14:39] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:14:39] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [16:14:43] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:15:11] (03CR) 10Dr0ptp4kt: "Leaving a note for future self in case of similar changes." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [16:15:23] RECOVERY - puppet last run on cp1085 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:15:33] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:15:35] RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:15:35] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:15:43] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:15:45] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:15:45] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:15:53] RECOVERY - puppet last run on cp1087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:16:07] RECOVERY - puppet last run on cp4027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:16:15] RECOVERY - puppet last run on cp5009 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:16:15] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file-frontend] [16:16:37] RECOVERY - puppet last run on cp5011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:18:22] (03PS1) 10Muehlenhoff: prometheus::node_exporter: Change OS detection for buster [puppet] - 10https://gerrit.wikimedia.org/r/489246 [16:18:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudelastic1004.wikimedia.org ` The log can be f... [16:18:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudelastic1004.wikimedia.org'] ` Of which those **FAILED**: ` ['cloudelastic1004.wikimedia.org'] ` [16:19:06] dr0ptp4kt: it should be live everywhere [16:19:23] arturo: o/ - if you have time I can merge now and let you run puppet to verify that nothing looks weird [16:19:29] well, I say that, but some puppetfails still sticky, one last run [16:19:32] bblack: thx. i see https://translate.google.com/translate?sl=auto&tl=id&u=https%3A%2F%2Fsimple.wikipedia.org%2Fwiki%2FCholera redirecting as expected and i see googlebot not getting redirected, so that's a start [16:19:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudelastic1004.wikimedia.org ` The log can be f... [16:20:14] elukey: ok [16:21:34] super [16:21:47] bblack: kafkacat doesn't even seem to be turning up hits for cp3039 - not sure if that's a symptom of where it is located in the topology or what [16:21:55] (03PS9) 10Elukey: Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) [16:22:11] dr0ptp4kt: 3039 is one the ones still failing puppet, it should catch up shortly [16:23:23] arturo: done! [16:25:03] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:25:13] elukey: :-( [16:25:15] https://www.irccloud.com/pastebin/aARqo1B7/ [16:25:19] RECOVERY - puppet last run on cp1088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:05] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:21] arturo: of course that is a class not a define! [16:26:27] RECOVERY - puppet last run on cp5002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:32] :-/ my fault elukey [16:26:47] RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:26:53] no no I was sloppy, lemme fix it [16:26:54] sorry [16:27:05] ok [16:27:21] RECOVERY - puppet last run on cp4022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:27] RECOVERY - puppet last run on cp1084 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:28:11] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:33] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:29:47] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [16:29:55] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:30:21] dr0ptp4kt: where are you checking kafkacat at? [16:30:34] bblack: stat1007 [16:30:37] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:30:53] (03PS1) 10Elukey: profile::toolforge::bastion::resourcecontrol: fix class definition [puppet] - 10https://gerrit.wikimedia.org/r/489250 [16:31:37] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:55] (03CR) 10Elukey: [C: 03+2] profile::toolforge::bastion::resourcecontrol: fix class definition [puppet] - 10https://gerrit.wikimedia.org/r/489250 (owner: 10Elukey) [16:32:20] arturo: should be better now [16:33:17] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:34:13] RECOVERY - puppet last run on cp5003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:34:39] (03PS1) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 [16:34:51] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 (owner: 10Alexandros Kosiaris) [16:35:21] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:35:24] elukey: indeed, thanks [16:35:29] https://www.irccloud.com/pastebin/QUQp3b8N/ [16:35:49] RECOVERY - puppet last run on cp5006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:37:14] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:38:06] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:38:18] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:38:28] RECOVERY - puppet last run on cp1076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:38:34] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:41:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudelastic1004.wikimedia.org'] ` and were **ALL** successful. [16:41:52] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:42:14] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:42:57] 10Operations, 10Icinga, 10monitoring: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Dzahn) a:05Dzahn→03None Was it really both "passive checks" and "downtime stopped working" at the same time or just one of them? [16:44:32] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Wikimedia-Logstash, and 2 others: MediaWiki errors overloading logtash - https://phabricator.wikimedia.org/T215611 (10fsero) my 2 cents. i think what @CDanis proposes seems the right approach, alternatively we could ratelimit kafka output using so... [16:45:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10aborrero) 05Open→03Resolved a:03aborrero Thanks @Cmjohnson and @MoritzMuehlenhoff, the server seems fine now: ` aborrero@cloudelastic1004:~ $ sudo sm... [16:47:31] (03PS2) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 [16:47:43] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 (owner: 10Alexandros Kosiaris) [16:49:27] (03CR) 10Fsero: [C: 04-1] "i think CI job did properly the job, fix the typo on codfw block pointing to eqiad and it might be better :)" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/489251 (owner: 10Alexandros Kosiaris) [16:49:56] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Dzahn) >>! In T214623#4938377, @RStallman-legalteam wrote: > Just confirming that the Master Serv... [16:50:15] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Dzahn) 05Stalled→03Open [16:50:43] (03PS3) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 [16:50:58] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 (owner: 10Alexandros Kosiaris) [16:51:16] (03CR) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/489251 (owner: 10Alexandros Kosiaris) [16:51:58] (03PS1) 10Muehlenhoff: Don't install pxz on buster [puppet] - 10https://gerrit.wikimedia.org/r/489252 [16:53:18] (03PS4) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 [16:53:31] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 (owner: 10Alexandros Kosiaris) [16:53:55] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Gehel) >>! In T214623#4938690, @Dzahn wrote: > Thanks, will do. Did the date stay the same? > >>>... [16:54:58] (03CR) 10Cwhite: [C: 03+1] prometheus::node_exporter: Change OS detection for buster [puppet] - 10https://gerrit.wikimedia.org/r/489246 (owner: 10Muehlenhoff) [16:55:18] (03CR) 10Ppchelko: [C: 04-1] "LGTM, however -1 for now because we can only deploy this after next week MW train." [puppet] - 10https://gerrit.wikimedia.org/r/489211 (https://phabricator.wikimedia.org/T214706) (owner: 10Bmansurov) [16:55:26] (03PS2) 10Jcrespo: mariadb: Pool rc slaves with higher weight to rebalance load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489200 (https://phabricator.wikimedia.org/T214720) [16:55:28] (03PS1) 10Jcrespo: mariadb: Depool db1099:s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489253 [16:56:32] (03PS5) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 [16:56:39] (03PS4) 10Gehel: admins: create user with analytics-privatedata access for juliaglen [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [16:56:46] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 (owner: 10Alexandros Kosiaris) [16:57:02] (03CR) 10jerkins-bot: [V: 04-1] admins: create user with analytics-privatedata access for juliaglen [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [17:03:35] bblack: things are in order but for one minor edge case - use of the "Desktop" link when using the translation proxy on simple english (that 301s and then the next request necessarily doesn't contain the toggle_view_desktip, which in turn makes it go back to mobile). that's a rather rare behavior from what i see, although i will doublecheck. anyway, it's something we'll want working for regular english, so i'll tend to that [17:03:57] s/desktip/desktop/ [17:10:56] (03CR) 10Dzahn: "> I'm quite a bit worried to see we are starting a "black list of bad words"" [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [17:11:02] PROBLEM - Disk space on labmon1001 is CRITICAL: DISK CRITICAL - free space: /srv 81974 MB (3% inode=93%) [17:11:39] (03PS1) 10Dzahn: Revert "add some common typo words to CI checks" [puppet] - 10https://gerrit.wikimedia.org/r/489262 [17:11:57] (03CR) 10jerkins-bot: [V: 04-1] Revert "add some common typo words to CI checks" [puppet] - 10https://gerrit.wikimedia.org/r/489262 (owner: 10Dzahn) [17:12:52] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @aaron Any more ideas about what could be... [17:16:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] Helm chart for eventgate-analytics deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [17:17:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] Helm chart for eventgate-analytics deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [17:21:23] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::node_exporter: Change OS detection for buster [puppet] - 10https://gerrit.wikimedia.org/r/489246 (owner: 10Muehlenhoff) [17:21:57] dr0ptp4kt: was that even working before the GT patch? [17:23:16] (03PS2) 10Dzahn: Revert "add some common typo words to CI checks" [puppet] - 10https://gerrit.wikimedia.org/r/489262 [17:23:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "add some common typo words to CI checks" [puppet] - 10https://gerrit.wikimedia.org/r/489262 (owner: 10Dzahn) [17:24:19] (03PS5) 10Gehel: admins: create user with analytics-privatedata access for juliaglen [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [17:24:23] (03PS1) 10Arturo Borrero Gonzalez: src:prometheus-openstack-exporter: run wrap-and-sort [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489264 [17:24:25] (03PS1) 10Arturo Borrero Gonzalez: src:prometheus-openstack-exporter: bump debhelper compat to 10 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489265 [17:24:27] (03PS1) 10Arturo Borrero Gonzalez: postinst: use non-existent home dir [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489266 [17:24:29] (03PS1) 10Arturo Borrero Gonzalez: src:prometheus-openstack-exporter: bump std-versions to 3.9.8 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489267 [17:24:31] (03PS1) 10Arturo Borrero Gonzalez: src:prometheus-openstack-exporter: switch to python3 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489268 [17:24:33] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.0.8-4 stretch-wikimedia [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489269 (https://phabricator.wikimedia.org/T215605) [17:25:23] (03PS2) 10Ayounsi: Icinga: add ping check for ulsfo PDUs [puppet] - 10https://gerrit.wikimedia.org/r/489113 (https://phabricator.wikimedia.org/T209101) [17:25:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] src:prometheus-openstack-exporter: run wrap-and-sort [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489264 (owner: 10Arturo Borrero Gonzalez) [17:25:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] src:prometheus-openstack-exporter: bump debhelper compat to 10 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489265 (owner: 10Arturo Borrero Gonzalez) [17:25:42] bblack: yeah, as i recall it did work (and following the code paths suggest it did), and for the other domains outside of simple it's still working [17:26:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] postinst: use non-existent home dir [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489266 (owner: 10Arturo Borrero Gonzalez) [17:26:03] i can check the web logs, though. it's just such a rare behavior to go through the proxy and then go to the desktop link [17:26:09] and yet it defies user intent [17:26:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] src:prometheus-openstack-exporter: bump std-versions to 3.9.8 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489267 (owner: 10Arturo Borrero Gonzalez) [17:26:23] (03CR) 10Ayounsi: [C: 03+2] Icinga: add ping check for ulsfo PDUs [puppet] - 10https://gerrit.wikimedia.org/r/489113 (https://phabricator.wikimedia.org/T209101) (owner: 10Ayounsi) [17:26:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] src:prometheus-openstack-exporter: switch to python3 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489268 (owner: 10Arturo Borrero Gonzalez) [17:26:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: generate entry for 0.0.8-4 stretch-wikimedia [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489269 (https://phabricator.wikimedia.org/T215605) (owner: 10Arturo Borrero Gonzalez) [17:27:05] !log merge Icinga: add ping check for ulsfo PDUs [17:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:35] (03PS2) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.0.8-4 stretch-wikimedia [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489269 (https://phabricator.wikimedia.org/T215605) [17:28:09] (03CR) 10BBlack: [C: 03+1] "Seems sane. Obviously clients will need to actually support SRV lookups explicitly!" [dns] - 10https://gerrit.wikimedia.org/r/489170 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [17:30:20] (03CR) 10Dzahn: [C: 03+2] admins: create user with analytics-privatedata access for juliaglen [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [17:31:20] (03Abandoned) 10Jcrespo: mariadb: Depool db1099:s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489253 (owner: 10Jcrespo) [17:31:43] (03CR) 10Jcrespo: [C: 03+2] mariadb: Pool rc slaves with higher weight to rebalance load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489200 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [17:32:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: generate entry for 0.0.8-4 stretch-wikimedia [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/489269 (https://phabricator.wikimedia.org/T215605) (owner: 10Arturo Borrero Gonzalez) [17:32:56] (03Merged) 10jenkins-bot: mariadb: Pool rc slaves with higher weight to rebalance load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489200 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [17:34:48] (03PS6) 10Dzahn: admins: create user with analytics-privatedata access for juliaglen [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) [17:34:58] 10Operations, 10Icinga, 10monitoring: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 (10jcrespo) The thing that we knew is "passive checks awol", then restart, then gone. We didn't test downtiming. [17:35:21] (03PS1) 10Papaul: DNS: Remove mgmt DNS for mw2213 [dns] - 10https://gerrit.wikimedia.org/r/489271 (https://phabricator.wikimedia.org/T203434) [17:37:33] (03CR) 10jenkins-bot: mariadb: Pool rc slaves with higher weight to rebalance load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489200 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [17:38:40] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) Transfer of 1h and 20m, probably sped up because I stopped replication (avoiding to replay many changes). [17:40:37] (03PS2) 10Papaul: DNS: Remove mgmt DNS for mw2213 [dns] - 10https://gerrit.wikimedia.org/r/489271 (https://phabricator.wikimedia.org/T203434) [17:42:19] !log graceful reload of apache on phabricator prod server (phab1001) [17:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:45] PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [17:43:51] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [17:44:02] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for mw2213 [dns] - 10https://gerrit.wikimedia.org/r/489271 (https://phabricator.wikimedia.org/T203434) (owner: 10Papaul) [17:44:51] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10Papaul) [17:46:01] 10Operations, 10Mail, 10Phabricator, 10serviceops, and 2 others: Convert Phabricator mail config to use cluster.mailers - https://phabricator.wikimedia.org/T212989 (10greg) [17:46:05] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10Papaul) 05Open→03Resolved complete [17:46:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/489170 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [17:47:43] !log phab1001 - restarting apache2 service for library upgrade [17:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:07] !log T215605 add prometheus-openstack-exporter 0.0.8-4 to stretch-wikimedia [17:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:10] T215605: cloudvps: missing packages in stretch for cloudcontrol servers - https://phabricator.wikimedia.org/T215605 [17:49:43] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) Checked temperature in the rack all looks good. add blanks to the rack since we have only 8 servers in that rack. Leaving the task open for another week. [17:50:53] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [17:51:33] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [17:52:03] !log phab1001 - restarting phd service [17:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:31] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [17:53:50] these failures are a new race condition when adding new shell users [17:53:57] and fixes itself after the second puppet run [17:54:15] it tries to ensure all users are member of all users group..before a user has been created [17:54:29] (03PS1) 10Arturo Borrero Gonzalez: openstack: add keystone support for mitaka/stretch in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) [17:54:48] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Nuria) Contract was signed so we shoudl be good to go here. Dates are unchanged. [17:54:59] !log phab1001 - restart aphlict service [17:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:04] (03CR) 10jerkins-bot: [V: 04-1] openstack: add keystone support for mitaka/stretch in cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/489275 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [17:55:11] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Nuria) @Julia.glen to confirm she has access and ticket can be closed. [17:57:39] PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [17:58:05] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [17:58:15] you will see it recover on stat1004 /notebook1003 now cause i ran puppet [17:59:25] PROBLEM - puppet last run on an-master1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [18:00:04] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Dzahn) 05Open→03Resolved puppet is creating her user on all the relevant servers right now. in... [18:00:09] mutante: re:icinga I don't exect there is any actionable, but I think it was important to note it on the ticket [18:00:35] jynus: ACK, thanks for adding that. makes sense [18:00:57] 10Operations, 10ops-codfw, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10Papaul) [18:02:07] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:02:55] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:03:01] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [18:03:07] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:03:45] does anyone know when we added the "ensure all users are in the special allusers group" thing [18:03:53] (03PS6) 10Alexandros Kosiaris: Add kubernetes pods PTR records for IPv4 [dns] - 10https://gerrit.wikimedia.org/r/489251 [18:03:55] (03PS1) 10Alexandros Kosiaris: Run zole validation on generated zonefiles [dns] - 10https://gerrit.wikimedia.org/r/489277 [18:04:30] it seems like we should have gotten this before .. lots of users were created [18:04:41] RECOVERY - puppet last run on an-master1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:06:39] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [18:08:52] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for s1 rc slaves (duration: 00m 49s) [18:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:18] mutante: I added some users not a long time ago, I don't remember that being a thing [18:09:53] jynus: ok.. hmm. it's odd it seems consistent now, but it's just a race [18:09:54] (03CR) 10Thiemo Kreuz (WMDE): "> This is not about compiling a list of bad words." [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [18:10:07] RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:10:15] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:10:19] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Revert "add some common typo words to CI checks" [puppet] - 10https://gerrit.wikimedia.org/r/489262 (owner: 10Dzahn) [18:10:21] there it goes [18:11:15] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [18:11:29] PROBLEM - puppet last run on an-master1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [18:11:53] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:55] !log T213527 graphite2002 disabled puppet and commented prometheus_puppet_agent_stats cronjob due to cronspam [18:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:58] T213527: Prepare our base system layer for Debian buster - https://phabricator.wikimedia.org/T213527 [18:15:12] (03PS3) 10Dzahn: Revert "add some common typo words to CI checks" [puppet] - 10https://gerrit.wikimedia.org/r/489262 [18:18:56] (03CR) 10MarcoAurelio: "I find rather offensive the questioning of the morality of the (volunteer in my case) time that I and others have spent fixing those typos" [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [18:20:53] (03CR) 10Dzahn: "> But it is. The effect of this list is not like in a word processor that highlights possible mistakes, and let's the user decide. The eff" [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [18:21:38] (03PS4) 10Dzahn: Revert "add some common typo words to CI checks" [puppet] - 10https://gerrit.wikimedia.org/r/489262 [18:22:04] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:22:12] 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10GTirloni) @RobH @Cmjohnson thanks a lot for this, really appreciate the effort. ` ------------------------------------------------------------ Server listening on TCP port 5001... [18:22:19] 10Operations, 10ops-eqiad, 10Patch-For-Review: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10GTirloni) 05Open→03Resolved [18:22:43] (03PS1) 10Jcrespo: mariadb: Enable notifications for db1118 [puppet] - 10https://gerrit.wikimedia.org/r/489280 (https://phabricator.wikimedia.org/T214720) [18:22:56] (03CR) 10Dzahn: [C: 03+2] Revert "add some common typo words to CI checks" [puppet] - 10https://gerrit.wikimedia.org/r/489262 (owner: 10Dzahn) [18:23:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Enable notifications for db1118 [puppet] - 10https://gerrit.wikimedia.org/r/489280 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [18:23:58] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:26:26] (03PS1) 10Jcrespo: mariadb: Introduce and pool db1118 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) [18:28:11] (03PS1) 10Jcrespo: mariadb: Pool db1118 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489282 (https://phabricator.wikimedia.org/T214720) [18:28:42] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:37:16] RECOVERY - puppet last run on an-master1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:40:14] (03CR) 10Paladox: "I don't see the need to have reverted this." [puppet] - 10https://gerrit.wikimedia.org/r/489262 (owner: 10Dzahn) [18:42:16] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:49:07] (03PS1) 10BBlack: zone_validator: require -z argument zones dir [dns] - 10https://gerrit.wikimedia.org/r/489287 [18:49:09] (03PS1) 10BBlack: deploy-check: integrate other checks, no-gdnsd opt [dns] - 10https://gerrit.wikimedia.org/r/489288 [18:49:11] (03PS1) 10BBlack: update README and run-tests.sh [dns] - 10https://gerrit.wikimedia.org/r/489289 [18:49:28] (03CR) 10jerkins-bot: [V: 04-1] zone_validator: require -z argument zones dir [dns] - 10https://gerrit.wikimedia.org/r/489287 (owner: 10BBlack) [18:49:33] (03CR) 10jerkins-bot: [V: 04-1] deploy-check: integrate other checks, no-gdnsd opt [dns] - 10https://gerrit.wikimedia.org/r/489288 (owner: 10BBlack) [18:49:43] (03PS1) 10Bstorm: toolforge: Use a really old version of kubectl for the current k8s [puppet] - 10https://gerrit.wikimedia.org/r/489291 (https://phabricator.wikimedia.org/T215586) [18:49:53] (03PS1) 10BBlack: authdns-local-update: update deploy-check.py args [puppet] - 10https://gerrit.wikimedia.org/r/489292 [18:50:42] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Use a really old version of kubectl for the current k8s [puppet] - 10https://gerrit.wikimedia.org/r/489291 (https://phabricator.wikimedia.org/T215586) (owner: 10Bstorm) [18:50:52] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) Except for the above 3 patches, db1118 should be ready to go (not done so late in the week for obvious reasons). [18:52:45] (03PS2) 10Bstorm: toolforge: Use a really old version of kubectl for the current k8s [puppet] - 10https://gerrit.wikimedia.org/r/489291 (https://phabricator.wikimedia.org/T215586) [18:54:25] (03PS2) 10BBlack: zone_validator: require -z argument zones dir [dns] - 10https://gerrit.wikimedia.org/r/489287 [18:54:27] (03PS2) 10BBlack: deploy-check: integrate other checks, no-gdnsd opt [dns] - 10https://gerrit.wikimedia.org/r/489288 [18:54:29] (03PS2) 10BBlack: update README and run-tests.sh [dns] - 10https://gerrit.wikimedia.org/r/489289 [19:06:50] (03CR) 10Mobrovac: "Idem as Petr, LGTM, but we have to wait." [puppet] - 10https://gerrit.wikimedia.org/r/489211 (https://phabricator.wikimedia.org/T214706) (owner: 10Bmansurov) [19:09:39] (03CR) 10Mobrovac: "> Would a comment instead of a removal work then?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris) [19:15:03] (03PS1) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [19:15:36] (03CR) 10jerkins-bot: [V: 04-1] Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [19:16:18] (03PS3) 10Volans: sre.hosts: add decommission cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886) [19:16:54] (03PS2) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [19:17:26] (03CR) 10BryanDavis: nova: add wmcs-rescue-console.sh to compute hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489230 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [19:19:55] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Wikimedia-Logstash, and 2 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10Reedy) [19:24:26] (03PS4) 10Dzahn: mediawiki/scap: do not install sql scripts on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/479142 (https://phabricator.wikimedia.org/T211512) [19:30:39] 10Operations, 10Traffic, 10Core Platform Team Backlog (Designing), 10Patch-For-Review, and 5 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Anomie) A discussion about later plans for this task was accidentally started in code review. Copying... [19:32:57] 10Operations, 10Traffic, 10Core Platform Team Backlog (Designing), 10Patch-For-Review, and 5 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Anomie) >>! In [[https://gerrit.wikimedia.org/r/c/mediawiki/core/+/487544#message-3709f9fc817ed264ba9... [19:39:19] (03PS1) 10Krinkle: Set MW_NO_SESSION for various entry points in w/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489302 [19:39:26] (03CR) 10Dzahn: "@Effie i amended to a new version that simplifies and avoids creating a new profile. i now remember why i did that, i wanted to avoid havi" [puppet] - 10https://gerrit.wikimedia.org/r/479142 (https://phabricator.wikimedia.org/T211512) (owner: 10Dzahn) [19:39:37] (03CR) 10Volans: [C: 03+2] sre.hosts: add decommission cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [19:40:18] (03CR) 10Krinkle: "E.g. https://performance.wikimedia.org/xenon/svgs/daily/2019-02-07.touch.svgz shows that "MediaWiki\Session\SessionManager::getSessionFrom" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489302 (owner: 10Krinkle) [19:40:49] (03CR) 10Krinkle: "https://performance.wikimedia.org/xenon/svgs/daily/2019-02-07.favicon.svgz" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489302 (owner: 10Krinkle) [19:41:45] (03Merged) 10jenkins-bot: sre.hosts: add decommission cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [19:42:38] (03CR) 10Krinkle: [C: 03+2] Set MW_NO_SESSION for various entry points in w/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489302 (owner: 10Krinkle) [19:43:26] bblack: I've not yet read the whole backlog but the last ~24h of puppet run logs are always available in puppetboard [19:43:47] (03Merged) 10jenkins-bot: Set MW_NO_SESSION for various entry points in w/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489302 (owner: 10Krinkle) [19:43:54] (03CR) 10Anomie: [C: 03+1] Set MW_NO_SESSION for various entry points in w/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489302 (owner: 10Krinkle) [19:44:20] anomie: thx, just testing on mwdebug for now. [19:44:45] * Krinkle staging on mwdebug1002 [19:46:07] and they all still return 200 OK with the same response as without XWD, so I'll sync it out [19:47:47] !log krinkle@deploy1001 Synchronized w/extract2.php: Ia1e610a5f (duration: 00m 48s) [19:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:34] !log krinkle@deploy1001 Synchronized w/favicon.php: Ia1e610a5f (duration: 00m 46s) [19:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:06] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Wikimedia-Logstash, and 2 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10Marostegui) [19:49:21] !log krinkle@deploy1001 Synchronized w/robots.php: Ia1e610a5f (duration: 00m 46s) [19:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:08] !log krinkle@deploy1001 Synchronized w/touch.php: Ia1e610a5f (duration: 00m 46s) [19:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:02] (03CR) 10Thiemo Kreuz (WMDE): "> I don't see the need to have reverted this." [puppet] - 10https://gerrit.wikimedia.org/r/489262 (owner: 10Dzahn) [19:53:45] (03CR) 10jenkins-bot: Set MW_NO_SESSION for various entry points in w/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489302 (owner: 10Krinkle) [19:57:36] (03CR) 10Krinkle: "To recap: We're continuing to use typos in the puppet repo, as before, for the purpose of common mistakes in normative code (i.e. statemen" [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [20:01:31] (03CR) 10Thiemo Kreuz (WMDE): "> […] questioning of the morality of the (volunteer in my case) time that I and others have spent fixing those typos […]" [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [20:03:34] (03CR) 10Krinkle: "@Thiemo It is not. You may be looking for grunt-tyops, https://github.com/wikimedia/grunt-tyops." [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [20:05:03] (03CR) 10Paladox: "> > I don't see the need to have reverted this." [puppet] - 10https://gerrit.wikimedia.org/r/489262 (owner: 10Dzahn) [20:08:49] (03CR) 10Cwhite: [C: 03+2] prometheus::node_exporter: Change OS detection for buster [puppet] - 10https://gerrit.wikimedia.org/r/489246 (owner: 10Muehlenhoff) [20:08:57] (03PS2) 10Cwhite: prometheus::node_exporter: Change OS detection for buster [puppet] - 10https://gerrit.wikimedia.org/r/489246 (owner: 10Muehlenhoff) [20:09:08] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Hi team. Okay, this is activated for Simple English as the source wiki. Thank you so much @BBlack! I'll prepare two patches n... [20:15:52] (03PS6) 10Zoranzoki21: Set wgRestrictionLevels for all Serbian projects to autoconfirmed, autopatrol and sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 [20:16:52] (03CR) 10Ottomata: Helm chart for eventgate-analytics deployment (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [20:17:01] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) [20:17:51] (03CR) 10MarcoAurelio: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/487917 (https://phabricator.wikimedia.org/T201491) (owner: 10Dzahn) [20:18:31] (03PS4) 10Herron: lists:warn if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/488602 (https://phabricator.wikimedia.org/T215251) [20:21:07] (03CR) 10Herron: [C: 03+2] lists:warn if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/488602 (https://phabricator.wikimedia.org/T215251) (owner: 10Herron) [20:22:31] (03PS7) 10Zoranzoki21: Set wgRestrictionLevels for all Serbian projects to autoconfirmed, autopatrol, patroller, rollbacker and sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (https://phabricator.wikimedia.org/T215653) [20:29:37] (03PS4) 10Ottomata: Refactor mysql::config::client to mariadb::config::client [puppet] - 10https://gerrit.wikimedia.org/r/482693 (https://phabricator.wikimedia.org/T162070) [20:31:15] (03CR) 10Ottomata: [C: 03+2] Refactor mysql::config::client to mariadb::config::client [puppet] - 10https://gerrit.wikimedia.org/r/482693 (https://phabricator.wikimedia.org/T162070) (owner: 10Ottomata) [20:32:32] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Ottomata) [20:44:16] (03PS1) 10Ottomata: Remove role::wikimetrics::staging [puppet] - 10https://gerrit.wikimedia.org/r/489314 (https://phabricator.wikimedia.org/T162070) [20:44:56] (03CR) 10Ottomata: [C: 03+2] Remove role::wikimetrics::staging [puppet] - 10https://gerrit.wikimedia.org/r/489314 (https://phabricator.wikimedia.org/T162070) (owner: 10Ottomata) [20:45:31] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Ottomata) [20:46:49] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Ottomata) a:05Ottomata→03Dzahn Thanks Daniel! The Analytics usages are gone. I'm ass... [20:49:40] (03CR) 10Ottomata: "Same, thanks yall!" [puppet] - 10https://gerrit.wikimedia.org/r/489211 (https://phabricator.wikimedia.org/T214706) (owner: 10Bmansurov) [21:13:41] ottomata: :) thanks! [21:16:55] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:17:39] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 450.80 seconds [21:17:41] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 451.62 seconds [21:17:49] * gehel is looking at wdqs1005 [21:18:55] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [21:19:12] sho thang mutante :) [21:19:47] (03PS3) 10Reedy: Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487 [21:19:51] jouncebot: now [21:19:52] No deployments scheduled for the next 61 hour(s) and 10 minute(s) [21:19:54] jouncebot: next [21:19:54] In 61 hour(s) and 10 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T1030) [21:22:58] (03PS4) 10Reedy: Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487 [21:23:08] (03CR) 10Reedy: [C: 03+2] Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487 (owner: 10Reedy) [21:23:53] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 27.68 seconds [21:23:55] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 21.57 seconds [21:23:59] 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10BBlack) >>! In T214529#4936555, @CDanis wrote: >>>> Corrected errors are normal and expected to occur on healthy >>>> hardware. They do not need user's attention unt... [21:24:14] (03Merged) 10jenkins-bot: Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487 (owner: 10Reedy) [21:24:57] (03CR) 10jenkins-bot: Stop breaking blame for wikimedia special cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473487 (owner: 10Reedy) [21:25:59] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: Move variable (duration: 00m 49s) [21:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:21] (03CR) 10Volans: "I'm not very familiar with our puppet tests, I can have a more deeper look next week. But CI seems to be happy :)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [21:36:16] (03PS15) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [21:37:21] (03CR) 10Jbond: "thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [21:37:42] (03PS16) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [21:42:21] 10Puppet, 10Cloud-VPS, 10serviceops: upgrade simplelap / simpelap classes (apache -> httpd and mysql -> mariadb) or deprecate them - https://phabricator.wikimedia.org/T215662 (10Dzahn) [21:44:08] 10Puppet, 10Cloud-VPS, 10serviceops: upgrade simplelap / simpelap classes (apache -> httpd and mysql -> mariadb) or deprecate them - https://phabricator.wikimedia.org/T215662 (10Dzahn) [21:45:32] 10Puppet, 10Cloud-VPS, 10serviceops: upgrade simplelap / simpelap classes (apache -> httpd and mysql -> mariadb) or deprecate them - https://phabricator.wikimedia.org/T215662 (10Dzahn) [21:45:49] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 297.02 seconds [21:48:11] (03CR) 10Andrew Bogott: [C: 04-1] "This patch gets the terminal up but the 'autologin root' bit does nothing at all. I've tried a bunch of variations of this patch to no av" [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [21:48:28] 10Puppet, 10Cloud-VPS, 10serviceops: upgrade simplelap / simplelamp classes (apache -> httpd and mysql -> mariadb) or deprecate them - https://phabricator.wikimedia.org/T215662 (10Dzahn) [21:56:22] 10Puppet, 10Cloud-VPS, 10serviceops: upgrade simplelap / simplelamp classes (apache -> httpd and mysql -> mariadb) or deprecate them - https://phabricator.wikimedia.org/T215662 (10Dzahn) [22:15:43] 10Puppet, 10Cloud-VPS, 10serviceops: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10Dzahn) [22:16:54] (03PS1) 10Cwhite: prometheus: post-upgrade node-exporter cleanup [puppet] - 10https://gerrit.wikimedia.org/r/489325 (https://phabricator.wikimedia.org/T213708) [22:18:51] (03PS1) 10Dzahn: convert simplelamp from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/489326 (https://phabricator.wikimedia.org/T215662) [22:22:56] (03PS1) 10Dzahn: convert simplelamp from mysql to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/489328 (https://phabricator.wikimedia.org/T215662) [22:26:33] (03CR) 10Dzahn: "p" [puppet] - 10https://gerrit.wikimedia.org/r/489326 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [22:27:35] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [22:28:15] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [22:30:19] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) >>! In T162070#4939472, @Ottomata wrote: > Thanks Daniel! The Analytics usages are... [22:56:35] !log running `refreshImageMetadata.php --mediatype BITMAP --mime image/vnd.djvu` against commonswiki on mwmaint1002 T215635 [22:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:38] T215635: Run refreshImageMetadata.php for new media type of DjVu files - https://phabricator.wikimedia.org/T215635 [23:03:47] Reedy: Hah. [23:04:01] Reedy: I was planning to do that next week, but OK. [23:04:21] It took a lot of my time to do it :P [23:04:54] * James_F grins. [23:05:03] Also, doesn't that want a ForEachWiki? [23:05:42] Probably [23:05:52] But commons is going to have most of them presumably... [23:06:15] Indeed. And given that this is for the SDC search feature, it's rather lower value elsewhere. [23:06:26] Reedy: Do you know how to write an update.php call for this? [23:07:04] runMaintenance [23:07:08] 10Operations: Reset Wikitech 2FA access for MarkAHershberger - https://phabricator.wikimedia.org/T215676 (10MarkAHershberger) [23:07:10] I don't know about passing parameters though [23:07:14] * James_F looks. [23:07:42] I guess DatabaseUpdater::$postDatabaseUpdateMaintenance technically [23:08:18] DatabaseUpdater::runMaintenance doesn't take options, yeah. [23:08:18] So it might be a simple wrapper function needed [23:09:20] Yeah. Also refreshImageMetadata doesn't extend LoggedUpdateMaintenance so… [23:09:53] Maybe I should make a RefreshImageMetadataForUpdate class that does it? [23:13:30] 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Reset Wikitech 2FA access for MarkAHershberger - https://phabricator.wikimedia.org/T215676 (10bd808) 05Open→03Resolved a:03bd808 ` $ ssh bastion-eqiad1-01.bastion.eqiad.wmflabs # ls -alh /home/mah/Please-reset-wikitech-2fa.txt -rw-... [23:19:21] Finished refreshing file metadata for 106634 files. 0 were refreshed, 106634 were already up to date, and 0 refreshes were suspicious. [23:19:28] James_F: Think we need a force? [23:19:47] * Reedy looks at teh db [23:20:08] Reedy: Oh, maybe, yeah, "Reload metadata from file even if the metadata looks ok" [23:20:24] 10Puppet, 10Cloud-VPS, 10serviceops, 10Patch-For-Review: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10Slevinski) Thanks for the heads up. The SignWriting project is active and the two websites I administer are needed. I che... [23:20:26] Do we really only have 11k DjVu files on all of Commons? [23:20:44] 106K? [23:21:00] Oh, right. That's plausible. [23:21:02] heh [23:21:59] I guess if we think file metadata might change from run to run, we should just force-run it for all files on update.php? [23:22:09] wikiadmin@10.64.48.11(commonswiki)> select img_media_type from image where img_name = 'Capital_by_Marx,_Karl.djvu'\G [23:22:09] *************************** 1. row *************************** [23:22:09] img_media_type: BITMAP [23:22:09] 1 row in set (0.00 sec) [23:23:24] running `refreshImageMetadata.php --mediatype BITMAP --mime image/vnd.djvu --force` against commonswiki on mwmaint1002 T215635 (this time we mean it) [23:23:25] T215635: Run refreshImageMetadata.php for new media type of DjVu files - https://phabricator.wikimedia.org/T215635 [23:23:32] !log running `refreshImageMetadata.php --mediatype BITMAP --mime image/vnd.djvu --force` against commonswiki on mwmaint1002 T215635 (this time we mean it) [23:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:26] quite a bit slower this time [23:34:56] Well, you are doing DB writes this time around, so yeah. [23:47:09] 10Operations, 10Traffic, 10Core Platform Team Backlog (Designing), 10MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), and 6 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) >>! In T201409#4939220, @Anomie wrote: > Defaulted to "never accep... [23:59:02] 10Puppet, 10Cloud-VPS, 10serviceops, 10Patch-For-Review: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10Dzahn) @Slevinski Not yet, i would warn you before i actually merge any changes. But be aware this role is currently broken...