[00:00:45] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [00:00:51] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:01:17] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [00:01:29] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:01:57] RECOVERY - puppet last run on labsdb1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:03:27] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:12:58] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.21/includes/user/User.php: Replace wgUser with RequestContext::getUser in User::getBlockedStatus (duration: 00m 49s) [00:12:59] reedy@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [00:14:26] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.21/tests/phpunit/includes/: Replace wgUser with RequestContext::getUser in User::getBlockedStatus (duration: 01m 00s) [00:14:27] reedy@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [00:15:53] !log test [00:20:26] (03PS1) 10Alex Monk: acme_chief: Add security::access::config on passive host if realm == labs [puppet] - 10https://gerrit.wikimedia.org/r/497430 [00:21:43] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Add security::access::config on passive host if realm == labs [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [00:22:48] * Krenair pokes wikibugs [00:23:27] (03PS2) 10Alex Monk: acme_chief: Add security::access::config on passive host if realm == labs [puppet] - 10https://gerrit.wikimedia.org/r/497430 [00:23:49] oh, just slow [00:33:45] (03PS1) 10Alex Monk: acme_chief: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/497431 [00:37:48] (03PS1) 10Ebe123: Partially revert "Enable musical notation datatype in wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T216730) [00:38:25] (03CR) 10Alex Monk: "(also, the package in wikimedia's stretch repo has not been kept up to date since prod moved to buster)" [puppet] - 10https://gerrit.wikimedia.org/r/497431 (owner: 10Alex Monk) [00:38:45] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) p:05Normal→03High Per T200739#5034407 [00:39:06] (03PS2) 10Ebe123: Partially revert "Enable musical notation datatype in wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T216730) [00:40:34] (03PS3) 10Ebe123: Partially revert "Enable musical notation datatype in wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T216730) [00:40:48] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10RobH) [00:59:31] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [01:10:27] (03PS1) 10Awight: Update DCAT-AP link [dumps/dcat] - 10https://gerrit.wikimedia.org/r/497435 [01:47:06] * Krinkle takes mwdebug1002 [01:47:12] Rolling out https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventLogging/+/497417/ ahead of wmf.22 to prevent UBN [01:58:43] (03CR) 10Alex Monk: [C: 04-2] "needs reuploading against acme-chief.git too" [software/certcentral] - 10https://gerrit.wikimedia.org/r/485017 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [01:59:33] I'm doing some live debugging on wikitech, if you see strange errors it's probably me [02:06:22] staging now [02:11:28] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.21/extensions/EventLogging/includes/RemoteSchema.php: If280a4056a (duration: 00m 51s) [02:11:28] krinkle@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [02:12:38] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.21/extensions/EventLogging/includes/ApiJsonSchema.php: If280a4056a (duration: 00m 48s) [02:12:38] krinkle@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [02:18:28] * Krinkle is done [02:56:04] (03CR) 10Alex Monk: [C: 03+2] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [02:57:03] (03PS1) 10Alex Monk: acme-chief: Ensure that the CN is part of the SNI list for certs config [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497439 (https://phabricator.wikimedia.org/T218418) [02:57:15] (03PS1) 10Alex Monk: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497440 (https://phabricator.wikimedia.org/T207295) [03:21:35] (03CR) 10Alex Monk: [C: 03+1] acme_chief: Update acme_chief::cert resource to fetch several cert versions [puppet] - 10https://gerrit.wikimedia.org/r/496148 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [03:22:37] (03PS4) 10Ebe123: Partially revert "Enable musical notation datatype in wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T218535) [03:27:53] (03PS13) 10Alex Monk: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [03:28:28] (03CR) 10Alex Monk: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [03:28:32] (03CR) 10Alex Monk: [C: 03+2] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [03:30:04] (03Merged) 10jenkins-bot: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [03:36:29] (03PS1) 10Alex Monk: Release 0.12 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497444 (https://phabricator.wikimedia.org/T218418) [03:36:41] (03PS2) 10Alex Monk: Release 0.13 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497444 (https://phabricator.wikimedia.org/T218418) [03:40:57] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 (10Krenair) 05Open→03Resolved @stwalkerster this should be better now [03:43:19] (03PS1) 10Alex Monk: Follow-up I71678b27: Remove stray MariaDB reference in openstack::clientpackages::newton::stretch [puppet] - 10https://gerrit.wikimedia.org/r/497445 (https://phabricator.wikimedia.org/T218009) [03:43:32] (03CR) 10Alex Monk: [C: 03+1] "Also Ib4ac80ba1" [puppet] - 10https://gerrit.wikimedia.org/r/497210 (https://phabricator.wikimedia.org/T218009) (owner: 10BryanDavis) [03:43:55] (03CR) 10jerkins-bot: [V: 04-1] Follow-up I71678b27: Remove stray MariaDB reference in openstack::clientpackages::newton::stretch [puppet] - 10https://gerrit.wikimedia.org/r/497445 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [03:45:38] !log Started manual run of unpublished ContentTranslation draft purge script (T218279) [03:45:40] kart_: Failed to log message to wiki. Somebody should check the error logs. [03:45:41] T218279: Run unpublished draft purge script for CX (Week of 03/17) - https://phabricator.wikimedia.org/T218279 [03:45:54] Starting few minutes earlier. [03:46:08] What happened to log? [03:46:52] But it seems logging on the task.. [03:48:06] tgr was looking into it [03:48:27] (03PS3) 10Alex Monk: Re-apply "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) [03:48:45] (03CR) 10jerkins-bot: [V: 04-1] Re-apply "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [03:48:47] OK. Thanks, Krenair [03:49:49] yeah, still dead [03:49:51] see T218608 [03:50:05] (03PS4) 10Alex Monk: Re-apply "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) [03:53:36] (03CR) 10Alex Monk: "python3-designateclient in jessie depends on python3-cliff (>= 1.15.0). That has to be from jessie-backports as the normal version is 1" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [04:00:04] kart_: #bothumor My software never has bugs. It just develops random features. Rise for . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T0400). [04:00:19] yes, jouncebot. Started :) [05:25:47] (03PS2) 10Mill: saaaaaaaaaaaaa [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497440 (owner: 10Alex Monk) [05:25:49] (03PS2) 10Mill: taaaaaaaaaaaaa [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497439 (owner: 10Alex Monk) [05:26:17] (03PS3) 10Mill: #aaaaaaaaaaaaa [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497444 (owner: 10Alex Monk) [05:26:22] (03PS2) 10Mill: 6aaaaaaaaaaaaa [dumps/dcat] - 10https://gerrit.wikimedia.org/r/497435 (owner: 10Awight) [05:27:26] (03PS5) 10Mill: (aaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (owner: 10Ebe123) [05:27:34] (03PS2) 10Mill: gbaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (owner: 10BryanDavis) [05:27:40] (03PS3) 10Mill: jbaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496857 (owner: 10Bmansurov) [05:27:54] (03PS3) 10Mill: 9aaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [05:28:12] ^^ Gerrit vandalism starting up again [05:28:19] (03PS5) 10Mill: scaaaaaaaaaaaa [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (owner: 10CRusnov) [05:28:30] (03PS4) 10Mill: wcaaaaaaaaaaaa [cookbooks] - 10https://gerrit.wikimedia.org/r/497326 (owner: 10Gehel) [05:29:43] James_F: ^ [05:30:37] (03PS5) 10Mill: iaaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/497009 (owner: 10Alex Monk) [05:30:45] (03PS4) 10Mill: (baaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/496991 (owner: 10BryanDavis) [05:30:59] (03PS2) 10Mill: %26eaaaaaaaaaaaa [deployment-charts] - 10https://gerrit.wikimedia.org/r/497262 (owner: 10Tarrow) [05:31:54] (03PS3) 10Mill: 1baaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/497318 (owner: 10Jbond) [05:31:58] (03PS2) 10Mill: ueaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497243 (owner: 10Gergő Tisza) [05:32:00] (03PS2) 10Mill: yeaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (owner: 10Ammarpad) [05:32:10] (03PS7) 10Mill: 7baaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/496719 (owner: 10Muehlenhoff) [05:32:12] (03PS15) 10Mill: acaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/493774 (owner: 10CRusnov) [05:32:28] (03PS2) 10Mill: 3caaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/497228 (owner: 10Marostegui) [05:32:30] (03PS17) 10Mill: 6caaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/472713 (owner: 10Dzahn) [05:32:42] (03PS3) 10Mill: lfaaaaaaaaaaaa [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/497120 (owner: 10Paladox) [05:33:10] (03PS2) 10Mill: 8caaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/497321 (owner: 10Ppchelko) [05:34:04] Reedy: marostegui: moritzm: revi: ^ vandalism [05:34:11] Any Gerrit admins around? Mill needs to be blocked/disabled [05:34:50] Crap [05:34:50] I'm poking some people too [05:34:53] Not again [05:34:57] sigh [05:35:25] (03PS16) 10Mill: qgaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [05:35:32] Thanks bd808. Wasn't sure who to poke. [05:35:39] (03PS2) 10Mill: odaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/495921 (owner: 10Gehel) [05:35:53] (03PS15) 10Mill: pgaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (owner: 10Daimona Eaytoy) [05:35:55] (03PS2) 10Mill: ehaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497081 (owner: 10Jforrester) [05:35:58] ... [05:36:17] (03PS5) 10Mill: zhaaaaaaaaaaaa [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (owner: 10CRusnov) [05:36:35] (03PS4) 10Mill: !daaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/497293 (owner: 10Arturo Borrero Gonzalez) [05:36:46] (03PS6) 10Mill: pdaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/496858 (owner: 10Andrew Bogott) [05:36:48] (03PS8) 10Mill: feaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/496746 (owner: 10Jcrespo) [05:36:50] (03PS2) 10Mill: meaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn) [05:39:22] see -releng [05:42:48] (03PS4) 10Mill: l$aaaaaaaaaaaa [puppet/nginx] - 10https://gerrit.wikimedia.org/r/492711 (owner: 10Mathew.onipe) [05:42:56] (03PS9) 10Mill: k$aaaaaaaaaaaa [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [05:43:05] (03PS2) 10Mill: rnaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/496130 (owner: 10GTirloni) [05:43:14] (03PS2) 10Mill: e$aaaaaaaaaaaa [dns] - 10https://gerrit.wikimedia.org/r/485081 (owner: 10Ayounsi) [05:54:34] (03PS6) 10Mill: !)aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/492676 (owner: 10Filippo Giunchedi) [05:55:52] (03PS12) 10Mill: 2)aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/490073 (owner: 10Fsero) [05:55:59] (03PS8) 10Mill: zpaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [05:56:01] (03PS2) 10Mill: mqaaaaaaaaaaaa [software/spicerack] - 10https://gerrit.wikimedia.org/r/492307 (owner: 10Gehel) [05:56:11] (03PS6) 10Mill: iqaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (owner: 10Ammarpad) [05:56:13] (03PS14) 10Mill: jqaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (owner: 10Ammarpad) [05:56:19] (03PS2) 10Mill: !qaaaaaaaaaaaa [deployment-charts] - 10https://gerrit.wikimedia.org/r/492269 (owner: 10Alexandros Kosiaris) [05:56:21] (03PS2) 10Mill: lqaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/492314 (owner: 10Paladox) [05:56:25] (03PS4) 10Mill: %5eqaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188 (owner: 10Aklapper) [05:57:16] (03PS7) 10Mill: 2qaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482548 (owner: 10Zoranzoki21) [05:57:18] (03PS6) 10Mill: 3qaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482534 (owner: 10Zoranzoki21) [05:57:20] (03PS5) 10Mill: 4qaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (owner: 10Zoranzoki21) [05:57:26] (03PS4) 10Mill: 5qaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (owner: 10DCausse) [05:57:30] (03PS3) 10Mill: eraaaaaaaaaaaa [software/spicerack] - 10https://gerrit.wikimedia.org/r/491812 (owner: 10Gehel) [05:57:34] (03PS3) 10Mill: graaaaaaaaaaaa [dns] - 10https://gerrit.wikimedia.org/r/483198 (owner: 10BBlack) [05:58:39] (03PS5) 10Mill: bsaaaaaaaaaaaa [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (owner: 10Gehel) [05:58:47] (03PS3) 10Mill: esaaaaaaaaaaaa [software/conftool] - 10https://gerrit.wikimedia.org/r/359919 (owner: 10Giuseppe Lavagetto) [05:59:15] (03PS5) 10Mill: %26saaaaaaaaaaaa [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/490299 (owner: 10ArielGlenn) [05:59:30] (03PS6) 10Mill: 9raaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/482792 (owner: 10Giuseppe Lavagetto) [05:59:36] (03PS2) 10Mill: jsaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/491191 (owner: 10Mathew.onipe) [05:59:38] (03PS2) 10Mill: %5esaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/491007 (owner: 10Andrew Bogott) [05:59:48] (03PS5) 10Mill: psaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [06:00:02] (03PS4) 10Mill: 9saaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490496 (owner: 10EBernhardson) [06:00:04] (03PS2) 10Mill: 1saaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/490834 (owner: 10Filippo Giunchedi) [06:00:09] (03PS3) 10Mill: itaaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/488530 (owner: 10Cwhite) [06:00:34] <_joe_> why? [06:00:47] <_joe_> sigh whatever [06:00:52] mm [06:00:53] ? [06:01:21] seems like this spammer isn't blocked yet https://gerrit.wikimedia.org/r/q/committer:mill%2540mail.com [06:01:39] is there a way to mass undo damage a gerrit spammer does? [06:01:45] yes [06:01:47] folks are on it [06:02:33] (03PS2) 10Mill: 11aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/482793 (owner: 10Giuseppe Lavagetto) [06:02:34] that's nice [06:02:34] (03PS3) 10Mill: b3aaaaaaaaaaaa [debs/php-excimer] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/481615 (owner: 10Hashar) [06:02:38] (03PS2) 10Mill: %5e2aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/481157 (owner: 10Faidon Liambotis) [06:02:44] (03PS2) 10Mill: 12aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/456317 (owner: 10Dzahn) [06:02:54] (03PS2) 10Mill: n3aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/481540 (owner: 10Fomafix) [06:02:56] (03PS2) 10Mill: r3aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/481303 (owner: 10ArielGlenn) [06:03:00] (03PS19) 10Mill: e3aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [06:03:10] (03PS2) 10Mill: h4aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/480790 (owner: 10Herron) [06:03:44] (03PS12) 10Mill: 25aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/456437 (owner: 10Paladox) [06:03:48] (03PS6) 10Mill: o6aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/477503 (owner: 10Banyek) [06:04:23] (03PS3) 10Mill: c7aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/455265 (owner: 10Aklapper) [06:04:25] (03PS4) 10Mill: i7aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [06:04:36] (03PS3) 10Mill: #7aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/476005 (owner: 10Ema) [06:04:49] (03PS2) 10Mill: w7aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/475500 (owner: 10Giuseppe Lavagetto) [06:04:51] (03PS2) 10Mill: 59aaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472157 (owner: 10WMDE-leszek) [06:04:55] (03PS2) 10Mill: c8aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/474825 (owner: 10Thcipriani) [06:04:57] (03PS5) 10Mill: 77aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/467317 (owner: 10Jcrespo) [06:05:03] (03PS3) 10Mill: m8aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/456319 (owner: 10Dzahn) [06:05:05] (03PS2) 10Mill: %268aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/474672 (owner: 10Muehlenhoff) [06:05:15] (03PS12) 10Mill: %269aaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/278400 (owner: 10ArielGlenn) [06:05:33] (03PS5) 10Mill: kcbaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/470814 (owner: 10BBlack) [06:05:35] (03PS3) 10Mill: 3cbaaaaaaaaaaa [dns] - 10https://gerrit.wikimedia.org/r/467709 (owner: 10Volans) [06:05:41] (03PS2) 10Mill: 0cbaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/469773 (owner: 10Faidon Liambotis) [06:05:43] (03PS6) 10Mill: zcbaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/469791 (owner: 10Mobrovac) [06:08:05] i feel it's the same person each time. the last time i saw spam (it was on phab), it changed titles to things of the form "xyaaaaaaaaaa [06:08:07] " [06:08:13] (03PS5) 10Mill: 0ibaaaaaaaaaaa [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453666 (owner: 10Legoktm) [06:08:18] (03PS4) 10Mill: xibaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/460451 (owner: 10EBernhardson) [06:08:22] (03PS4) 10Mill: vibaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/273146 (owner: 10GWicke) [06:08:30] (03PS2) 10Mill: njbaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/457409 (owner: 10Gehel) [06:08:32] (03PS3) 10Mill: kjbaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/456439 (owner: 10Smalyshev) [06:08:34] (03PS2) 10Mill: #jbaaaaaaaaaaa [puppet] - 10https://gerrit.wikimedia.org/r/457408 (owner: 10Gehel) [06:10:53] (03CR) 10jerkins-bot: [V: 04-1] )gaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (owner: 10Daimona Eaytoy) [06:11:01] (03CR) 10jerkins-bot: [V: 04-1] qgaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [06:11:13] (03CR) 10jerkins-bot: [V: 04-1] pgaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (owner: 10Daimona Eaytoy) [06:11:15] (03CR) 10jerkins-bot: [V: 04-1] ehaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497081 (owner: 10Jforrester) [06:11:30] (03CR) 10jerkins-bot: [V: 04-1] jiaaaaaaaaaaaa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494016 (owner: 10Ammarpad) [06:19:35] euh... [06:22:21] thedj: g'morning! or maybe not as good [06:23:05] uh [06:23:09] that time of the year? [06:23:47] the guy is persistant... [06:23:53] Perhaps integration should be stopped too https://integration.wikimedia.org/zuul/ [06:25:36] there have such a guy before? [06:25:39] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2052 is CRITICAL: cluster=mysql device=cciss,9 instance=db2052:9100 job=node site=codfw Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2052&var-datasource=codfw+prometheus/ops [06:26:28] razesoldier: some form of vandal pops up every year, sometimes a couple times. [06:26:47] this is slightly different, but the intent seems same [06:27:05] on the plus side, an extra incentive to review stale patches? 🤷 [06:30:03] anyone with op on -tech and/or -office? [06:30:09] need a quick kn [06:30:11] kb* [06:31:08] revi, yes [06:31:16] foks: see -ops-internal [06:31:21] eh.. this is interesting [06:31:22] 06:57 -!- Baylee [uid137715@wikimedia/TerraCodes] has joined #wikimedia-tech [06:31:25] 06:57 < Baylee> did the phab test install get nuked?, my user/pass isn't working [06:31:28] 06:57 -!- Guest97773 [2df50c6a@gateway/web/freenode/ip.45.245.12.106] has joined #wikimedia-tech [06:31:31] 07:05 < Baylee> nor is it saying theres an account for my emails when I try to reset my password [06:37:49] Who has the ability to ban Gerrit users then? Just curious [06:38:47] RelEng? [06:40:10] Hey addshore :P [06:40:52] Ohia [06:40:55] addshore: answer here https://wikitech.wikimedia.org/wiki/Gerrit#Disabling_/_Blocking_an_account [06:41:07] > This action is limited to users in the ldap/ops or capability-access-database groups. [06:41:57] I see [06:42:13] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit [06:42:47] marostegui: hooray :p [06:43:18] https://lists.wikimedia.org/pipermail/wikitech-l/2019-March/091792.html <- it's being worked on [06:44:11] awesome [06:46:21] i wonder why they'd go after just the commit messages [06:47:08] Hi, I'm trying to fetch repositories from gerrit but nothing works with me. any suggestions ? [06:47:17] shreyasminocha: i think i know why [06:47:27] iyeser: we are in incident response mode, please be patient [06:47:28] iyeser: gerrit is down for now [06:47:47] !log stop zuul and zuul-merger on contint1001 [06:47:48] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [06:47:52] lol [06:47:56] forgot about that [06:48:06] hehe [06:48:16] thedj: why? [06:48:20] Akosiaris, All phab emails have gone through now. [06:48:46] RhinosF1: emm. ok ? [06:49:27] PROBLEM - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused [06:49:29] You were the one that told me they were down yesterday. [06:49:35] shreyasminocha: i don't thnk i should say in a public channel [06:50:03] PROBLEM - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [06:50:11] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [06:52:47] ACKNOWLEDGEMENT - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused Vgutierrez maintenance [06:52:48] ACKNOWLEDGEMENT - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger Vgutierrez maintenance [06:52:48] ACKNOWLEDGEMENT - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server Vgutierrez maintenance [07:12:55] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [07:20:04] !log twentyafterfour@deploy1001 Synchronized wmf-config/CommonSettings.php: Temporarily disable account creation on wikitech (duration: 00m 51s) [07:20:05] twentyafterfour@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [07:25:05] !log Finished manual run of unpublished ContentTranslation draft purge script (T218279) [07:25:07] kart_: Failed to log message to wiki. Somebody should check the error logs. [07:25:08] T218279: Run unpublished draft purge script for CX (Week of 03/17) - https://phabricator.wikimedia.org/T218279 [07:25:15] (This was late logging, sorry for that) [07:27:31] Kart_ Its not logged anyway, stashbot isn't working [07:27:55] Yep [07:28:21] RhinosF1: Just to add in the task atleast. [07:37:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:40:01] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:40:59] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.11-18-gd3ca89353d (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [07:44:07] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:45:47] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:03] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [07:50:21] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [07:52:27] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [07:53:21] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [07:54:13] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [07:54:39] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [07:55:41] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [07:57:37] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [07:59:43] ^scheduled downtime for the "git pull" ones [07:59:56] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: nuke_limit often reached on esams varnish frontends - https://phabricator.wikimedia.org/T216006 (10Addshore) 05Open→03Resolved a:03ema [08:03:03] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:03:03] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_operations/mediawiki-config] [08:03:23] PROBLEM - puppet last run on db2094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:05:45] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:06:05] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [08:07:05] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 6 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] [08:07:33] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [08:08:43] PROBLEM - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:09:07] PROBLEM - puppet last run on labsdb1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:10:53] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [08:11:55] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:11:59] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:11:59] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [08:12:15] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [08:12:35] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:12:57] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:13:03] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [08:13:27] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:13:34] silencing more to reduce noise [08:13:41] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] [08:14:19] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [08:17:43] addshore: I have a change to quickly ban users, but unfortunately that broke something so was reverted, though I figured out what was wrong so re did the change [08:17:55] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [08:17:55] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [08:21:21] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:22:17] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [08:34:27] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [08:41:33] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [08:57:55] PROBLEM - proton endpoints health on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [08:58:35] PROBLEM - Check size of conntrack table on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [08:58:39] PROBLEM - configured eth on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [08:58:43] PROBLEM - Disk space on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [08:58:51] PROBLEM - Check systemd state on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [08:59:15] PROBLEM - DPKG on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [08:59:23] PROBLEM - dhclient process on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [08:59:25] PROBLEM - Check whether ferm is active by checking the default input chain on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [09:02:33] PROBLEM - puppet last run on proton1002 is CRITICAL: connect to address 10.64.32.61 port 5666: Connection refused [09:02:40] * akosiaris looking [09:03:04] fork() failed with error 12, bailing out... [09:03:53] not enough memory to fork? [09:05:16] #define ENOMEM 12 /* Out of memory */ [09:05:17] yeah [09:05:41] could also be pid table exhaustion according to stackoverflow [09:07:01] https://grafana.wikimedia.org/d/000000555/host-overview-grafanalib?panelId=4&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=proton1002&var-cluster=proton&from=now-1h&to=now&refresh=300s [09:07:11] yeah it's the memory it went above the total very briefly [09:07:41] that's obviously impossible (node exporter something something) but at least it's clear [09:07:43] fixing [09:07:59] RECOVERY - dhclient process on proton1002 is OK: PROCS OK: 0 processes with command name dhclient [09:08:01] RECOVERY - Check whether ferm is active by checking the default input chain on proton1002 is OK: OK ferm input default policy is set [09:08:18] !log start nagios-nrpe-server on proton1002, failed due to fork() failed with error 12, bailing out... [09:08:19] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [09:08:25] RECOVERY - Check size of conntrack table on proton1002 is OK: OK: nf_conntrack is 0 % full [09:08:29] RECOVERY - configured eth on proton1002 is OK: OK - interfaces up [09:08:33] RECOVERY - Disk space on proton1002 is OK: DISK OK [09:08:41] RECOVERY - Check systemd state on proton1002 is OK: OK - running: The system is fully operational [09:09:05] RECOVERY - DPKG on proton1002 is OK: All packages OK [09:09:07] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [09:12:57] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:33:57] 10Operations, 10Discovery-Search, 10Elasticsearch: cleanup the custom elasticsearch_${version}@ systemd unit in favor of an override configuration - https://phabricator.wikimedia.org/T218315 (10Gehel) actually, we're deploying a new unit as a template, so I'm not sure if we can just override the standard one... [09:42:14] jouncebot: now [09:42:14] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [09:44:19] sorry if I missed that info, but is there any expectation when Gerrit will be back on? Thanks! [09:45:07] Urbanecm: I hope pretty soon but there is no definite ETA [09:45:17] thanks twentyafterfour [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T1100). [11:00:04] alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:38] alaa_wmde: gerrit is down, so no swat [11:01:56] that ^ [11:28:38] @zeljkof yup thanks for the info [11:28:45] I'll move it to tomorrow for now [11:31:44] !log installing php5 security updates [11:31:45] moritzm: Failed to log message to wiki. Somebody should check the error logs. [11:32:00] no logging for now, manual updates later [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T1200) [13:00:04] zeljkof: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T1300). [13:00:53] lol [13:08:08] gerrit is down, so no train for now [13:12:49] !log unfirewall gerrit, put service back in action [13:12:50] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [13:13:47] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:13:48] Thanks! [13:13:57] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:14:51] Note that some changes are still dirty, see https://gerrit.wikimedia.org/r/#/q/status:open [13:15:02] I see 8 [13:15:25] <_joe_> right [13:16:05] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:17:01] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:17:51] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:18:08] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:18:23] Daimona: we will be cleaning those as well [13:18:32] but it's just those AFAICT by now [13:18:50] Sure, thanks [13:18:56] Just making sure you knew [13:19:51] !log start zuul/zuul-merger [13:19:51] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [13:19:56] Should be, as that page is sorted chronologically by last change [13:19:59] we also need to fix stash bot [13:20:05] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:20:11] RECOVERY - zuul_service_running on contint1001 is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [13:20:15] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [13:20:35] RECOVERY - zuul_gearman_service on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 [13:21:17] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:22:30] (03PS1) 10Marostegui: db-eqiad.php: Depool hosts in row A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497469 (https://phabricator.wikimedia.org/T187960) [13:23:12] whats wrong with stashbot? [13:23:41] Daimona: It's a lot less than before :P [13:23:56] Reedy: I imagine :) [13:24:03] (03CR) 10Zppix: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497466 (https://phabricator.wikimedia.org/T211700) (owner: 1020after4) [13:24:05] Is the amount of affected changes public? [13:24:32] (Or if it isn't, is there a "private" place where I can find it, for curiosity's sake?) [13:25:06] "a lot" [13:26:45] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:26:45] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:27:01] RECOVERY - puppet last run on db2094 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:27:28] (03PS1) 10Marostegui: db-eqiad.php: Set s2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497472 (https://phabricator.wikimedia.org/T187960) [13:28:19] (03CR) 10Marostegui: [C: 04-2] "Wait for the time to come" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497472 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [13:29:09] Daimona: no, I don't think they are public. In fact we 've been very aggressive in removing them [13:29:15] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:29:17] Wikimedia Platform Operations, serious stuff | Status: Up | Log: https://bit.ly/wikitech | Channel logs: https://bit.ly/opsirclog | Ops Clinic Duty:jijiki [13:29:20] gah [13:29:26] i am trying to look at stashbot logs.. toolforge is reallly slow [13:29:37] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:29:58] but the pattern is just hurling insults, nothing really worth it [13:30:07] I see - that makes sense [13:30:57] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [13:31:05] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:31:23] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Set s2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497472 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [13:31:57] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:32:21] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:32:49] RECOVERY - puppet last run on labsdb1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:33:08] marostegui: will we be going ahead? [13:34:19] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:35:01] addshore: Don't know, I am just preparing patches :) [13:35:14] ack :) [13:35:17] I guess i can do one too [13:35:19] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:35:25] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:35:27] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:35:27] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:35:41] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:36:03] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:36:17] (03CR) 10Zppix: [C: 03+1] Temporarily disable account creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497466 (https://phabricator.wikimedia.org/T211700) (owner: 1020after4) [13:36:31] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:36:31] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:37:05] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:37:13] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:37:41] marostegui: i'm in a meeting for the hour, but i can do the readonly ness that we need, we decided to just make the whole of wiktionaries read only for a few seconds right? [13:37:43] (03PS1) 10Alaa Sarhan: Set default wgScoreTrim to null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497475 (https://phabricator.wikimedia.org/T218643) [13:37:55] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:38:30] addshore: yep, that is it. I think we still have time for you to finish your meeting, it won't happen at 14:00 UTC exactly [13:38:50] marostegui: ack, that would be great [13:38:57] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:39:16] marostegui: looking at mediawiki-config, is readonly set via etcd now for wikis? [13:39:35] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:40:12] addshore: no, it is done like: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497472/1/wmf-config/db-eqiad.php [13:40:52] (03PS3) 10Alex Monk: openstack: Follow-up I71678b27: Remove stray MariaDB reference [puppet] - 10https://gerrit.wikimedia.org/r/497445 (https://phabricator.wikimedia.org/T218009) [13:41:04] marostegui: okay, that is by section, i guess we need to do it per random dblist [13:41:17] addshore: you better than me :) [13:41:21] :P [13:41:21] you know [13:41:23] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:41:58] Reedy: is that even a thing that is often done? / at all? [13:42:08] Is what? [13:42:18] Context in a busy channel with botspam [13:42:35] setting a single dblist of sites to be ready only, is there anything in place of should i just add an if at the bottom of commonsettings? [13:42:48] (03PS2) 10Alaa Sarhan: Set default wgScoreTrim to null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497475 (https://phabricator.wikimedia.org/T218643) [13:43:24] addshore: The commented bits at bottom of db-eqiad.php look more sensible [13:43:28] # $wgLBFactoryConf['readOnlyBySection']['s2'] = [13:43:28] # 'Scheduled maintenance, s2 wikis in read-only mode for a few minutes'; [13:43:47] (03Abandoned) 10Alaa Sarhan: Set default wgScoreTrim to null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497475 (https://phabricator.wikimedia.org/T218643) (owner: 10Alaa Sarhan) [13:43:51] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:43:59] (03CR) 10Alaa Sarhan: [C: 03+1] Partially revert "Enable musical notation datatype in wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497433 (https://phabricator.wikimedia.org/T218535) (owner: 10Ebe123) [13:44:16] Reedy: i guess i need to check if all wiktionaries are in s2 then *checks* [13:44:21] addshore like https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497472/ ? [13:45:15] i would have to lock s3 s5 and s7 [13:45:19] addshore: all of them? I don't think all of them are on s2, some might, but not all [13:45:30] if nobody complains, I would like to cut the branch and start the train [13:45:34] (03CR) 10Reedy: db-eqiad.php: Set s2 on read only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497472 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [13:45:43] addshore: The downtime is expected to be a few seconds, so your call :) [13:45:59] Reedy: no, but not the patch to remove them [13:46:00] zeljkof: go for it, I can cleanup the bits we need to add for this week after [13:46:20] marostegui: yes, ack, I'll prepare the patch and make sure the syncing of it at the time is speedy :) [13:46:24] addshore: ok, starting the train then [13:46:29] * addshore goes back to meeting for now [13:46:50] Reedy: I thought the same :) [13:48:28] Reedy: the rationale is because we are going to revert that patch, hopefully in 1 minute [13:48:36] Sure [13:48:48] I wasn't suggesting to remove it in that patch, more just a general comment [13:49:01] Reedy: WHERE IS YOUR PATCH! :-P [13:49:52] (03PS1) 10Reedy: Remove dupe DB comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497477 [13:49:55] I HOPE YOU'RE HAPPY NOW [13:49:59] xD [13:50:06] marostegui: ^^^^ [13:50:12] A NON SPANISH PERSON USING xD [13:50:19] XDDDDDD [13:50:29] oh no, what have I done [13:50:37] lol [13:50:40] You just gained spanish citizenship, I think [13:50:45] "congrats" [13:50:47] I have been talking to marostegui too much recently clearly [13:50:54] hahahaha [13:52:56] I use xD alot [13:54:22] you mean xaxaxaxa [13:54:39] XDDDD [13:56:29] LOL [13:59:06] (03PS1) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [14:00:08] 10Operations, 10netops: eqiad - eqord Telia link down - IC-314533 - https://phabricator.wikimedia.org/T218307 (10ayounsi) Opened a ticket with Equinix to check the X-connect. [14:04:25] (03CR) 10CDanis: Create and mtail parser for ulogd and install it on the syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) (owner: 10Jbond) [14:07:36] (03PS2) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [14:08:12] (03CR) 10Jcrespo: [C: 03+1] "Looks good, only 1 question." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497469 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:10:07] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [14:10:07] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:10:09] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [14:10:09] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:10:09] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:10:10] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:10:53] cutting the branch [14:11:13] (03PS2) 10Marostegui: db-eqiad.php: Depool hosts in row A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497469 (https://phabricator.wikimedia.org/T187960) [14:11:26] zeljkof: there is this on deploy1001: Your branch is ahead of 'origin/master' by 1 commit. [14:11:29] (03PS8) 10KartikMistry: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [14:11:43] marostegui: uh oh [14:11:50] is there something I need to do? [14:12:00] zeljkof: can you take a look at it? [14:12:07] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Depool hosts in row A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497469 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:12:26] zeljkof: we need to deploy ^ for a network maintenance :) [14:12:40] (03CR) 10jerkins-bot: [V: 04-1] Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [14:12:47] (03CR) 10KartikMistry: "> feel free to re-add me once it's out of WIP state and the manual" [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [14:13:00] marostegui: should I wait with cutting the branch? [14:13:08] I can take a look at the commit [14:13:09] (03CR) 10Muehlenhoff: "We'll need a more elaborate syntax here compared to the auto restarts, there's two things we want to express to ignore:" [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [14:13:17] zeljkof: Sure, if you don't mind, that'd be great [14:13:34] marostegui: ok, waiting, let me know when I can cut the branch [14:13:36] We need to depool looots of hosts and we want to do it soon enough to catch any surprises if any, before the network maintenance [14:13:56] zeljkof: Actually, you have to let me know when it is fine to deploy my change :) [14:14:25] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:14:35] marostegui: well, I need 1-2 hours for the train today [14:14:39] so we didn't block the deployments page [14:14:41] if nothing explodes [14:14:45] but we may have to [14:14:58] ok, so I'm confused [14:15:01] because we didn't chose the time, dcops did [14:15:17] there's an extra commit somewhere at deploy1001? where? [14:15:28] should I cut the branch and start the train, or wait? [14:15:28] zeljkof: I think it is fine if you run the train for now, what I need is basically 1) if you can take a look at what's going on on deploy1001, and decide what to do with that commit, then once that is fixed, I can deploy my change and that's it :) [14:15:39] zeljkof: deploy1001:/srv/mediawiki-staging [14:15:47] ok, looking, let's start there [14:16:13] (03PS3) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [14:16:30] zeljkof: thanks! once I have done my deploy I don't think we will go ahead with any other changes in the next 1h or so (and we can coordinate with you for those :) ) [14:16:53] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:17:46] marostegui: I see the extra commit, I guess I need to push it to gerrit, coordinating with -releng team [14:17:57] zeljkof: thank you! :) [14:18:21] (03CR) 10Jbond: "joe," [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [14:20:36] * jijiki lunch, ping me if needing anything clinic duty related [14:21:33] (03CR) 10Muehlenhoff: [C: 03+1] Add LimitCORE support for uwsgi units. [puppet] - 10https://gerrit.wikimedia.org/r/493294 (owner: 10CRusnov) [14:22:29] (03CR) 10Hashar: "CI has been disabled this morning so it has no knowledge about the Code-Review +2. To be merged, this patch will need a Code-Review+2 pat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497466 (https://phabricator.wikimedia.org/T211700) (owner: 1020after4) [14:23:41] (03PS1) 10Jbond: firewall-log: add defaults to common.yaml so WMCS can pick them up [puppet] - 10https://gerrit.wikimedia.org/r/497486 [14:24:34] (03CR) 10Jbond: [C: 03+2] firewall-log: add defaults to common.yaml so WMCS can pick them up [puppet] - 10https://gerrit.wikimedia.org/r/497486 (owner: 10Jbond) [14:25:52] (03PS3) 10Dzahn: set phab1002 as a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/496116 (https://phabricator.wikimedia.org/T215332) [14:26:23] (03CR) 10Muehlenhoff: [C: 03+1] cross-validate-accounts: add a diff output to validate_common_ops_group [puppet] - 10https://gerrit.wikimedia.org/r/483520 (owner: 10Dzahn) [14:27:18] (03PS9) 10KartikMistry: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [14:27:59] (03CR) 10Zfilipin: [C: 03+2] Temporarily disable account creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497466 (https://phabricator.wikimedia.org/T211700) (owner: 1020after4) [14:28:01] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [14:28:01] 10Operations, 10Services, 10VisualEditor, 10Readers-Web-Backlog (Tracking), 10Wikimedia-production-error: [Bug] Sporadic 503 errors when editing - https://phabricator.wikimedia.org/T218252 (10Pchelolo) Can this be resolved? [14:28:02] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:28:03] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [14:28:03] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:28:04] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:28:04] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:28:44] ottomata: logging is currently broken due to https://phabricator.wikimedia.org/T218608 [14:29:06] ah ok, thanks [14:29:10] those are auto logged by scap-helm [14:29:10] (03Merged) 10jenkins-bot: Temporarily disable account creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497466 (https://phabricator.wikimedia.org/T211700) (owner: 1020after4) [14:29:16] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Promote db1120 as x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496724 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:29:31] ottomata: *node* ..realized a moment later. just fyi [14:29:35] nod [14:29:36] win 5 [14:29:42] how does one node mutante haha [14:31:13] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [14:31:14] (03PS4) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [14:31:48] !log otto@deploy1001 scap-helm eventgate-analytics install -n production -f eventgate-analytics-codfw-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [14:31:49] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:31:57] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [14:31:57] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:32:00] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [14:32:00] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:32:00] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:32:00] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:32:30] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 5 others: Read from key-per-property cache - https://phabricator.wikimedia.org/T218124 (10alaa_wmde) a:03alaa_wmde [14:32:35] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [14:32:36] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:32:36] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [14:32:36] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:32:37] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:32:37] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:35:53] (03PS9) 10Jcrespo: mariadb-snapshots: Better error and logging handling [puppet] - 10https://gerrit.wikimedia.org/r/496746 (https://phabricator.wikimedia.org/T210292) [14:35:55] (03PS2) 10Jcrespo: mariadb: Promote db1120 as x1 master [puppet] - 10https://gerrit.wikimedia.org/r/496723 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:35:58] (03PS1) 10Jcrespo: mariadb-snapshots: Run transfer as root, not dump [puppet] - 10https://gerrit.wikimedia.org/r/497491 (https://phabricator.wikimedia.org/T206203) [14:36:25] (03PS3) 10Jcrespo: mariadb: Promote db1120 as x1 master [puppet] - 10https://gerrit.wikimedia.org/r/496723 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:37:46] (03CR) 10Jcrespo: [C: 03+1] mariadb: Failover db1066 to db1076 on s2 [puppet] - 10https://gerrit.wikimedia.org/r/496720 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:38:04] (03PS5) 10Dzahn: cross-validate-accounts: add a diff output to validate_common_ops_group [puppet] - 10https://gerrit.wikimedia.org/r/483520 [14:38:54] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Failover db1066 to db1076 on s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496721 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:38:59] 10Operations, 10Services, 10VisualEditor, 10Readers-Web-Backlog (Tracking), 10Wikimedia-production-error: [Bug] Sporadic 503 errors when editing - https://phabricator.wikimedia.org/T218252 (10Niedzielski) I can no longer reproduce the issue reported so closing works for me. However, I'm unsure if there's... [14:39:35] Zppix: modules/service :o [14:39:55] (03PS1) 10Ottomata: Change eventgate-analytics rdkafka statistics.interval.ms to 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/497492 [14:40:09] (03CR) 10Gehel: [C: 04-1] "see comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [14:40:13] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Change eventgate-analytics rdkafka statistics.interval.ms to 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/497492 (owner: 10Ottomata) [14:40:36] (03CR) 10Dzahn: [C: 03+2] cross-validate-accounts: add a diff output to validate_common_ops_group [puppet] - 10https://gerrit.wikimedia.org/r/483520 (owner: 10Dzahn) [14:41:13] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Reapply I49a18d8b36a0adfd9fcf9ef0d6c6f1bbc7e7bc68 from gerrit for consistency (duration: 00m 49s) [14:41:13] reedy@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:42:07] (03CR) 10Gehel: [C: 03+2] elasticsearch: multiple commands should be unpacked [cookbooks] - 10https://gerrit.wikimedia.org/r/497326 (https://phabricator.wikimedia.org/T218116) (owner: 10Gehel) [14:42:28] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool hosts in row A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497469 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:43:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool hosts in row A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497469 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [14:43:38] (03PS3) 10Gehel: elasticsearch: deploy elasticsearch config for ES6 [puppet] - 10https://gerrit.wikimedia.org/r/495921 (https://phabricator.wikimedia.org/T218116) [14:44:42] marostegui: please let me know when you're done, so I can cut the branch [14:44:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool databases in row A - T187960 (duration: 00m 48s) [14:44:52] zeljkof: done! :) [14:44:53] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:44:53] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [14:45:07] marostegui: thanks! cutting the branch then [14:45:20] zeljkof: cool, will coordinate with you for further pushes [14:46:12] marostegui: ok [14:46:37] !log rebooting icinga1001 for kernel update [14:46:37] moritzm: Failed to log message to wiki. Somebody should check the error logs. [14:49:07] zeljkof: T91454 looks deployed to me. [14:49:08] T91454: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454 [14:49:44] zeljkof: At least, it's in the mediawiki-staging/ copy of wmf.21. [14:49:59] Did you deploy it for K.rinkle or did he? [14:50:14] James_F: ah, I was looking at SAL, didn't see it there, assumed it's not deployed, but I guess there's something wrong with SAL [14:50:30] zeljkof: Oh, yes. SAL is totally inoperable since yesterday. [14:50:31] didn't check the filesystem [14:50:38] zeljkof: That's what "07:46:38 <+stashbot> moritzm: Failed to log message to wiki. Somebody should check the error logs." [14:50:41] means [14:50:41] James_F: thanks, makes sense now [14:50:43] Bah. [14:50:44] zeljkof: logging to wiki is broken, but https://tools.wmflabs.org/sal works [14:50:57] I've noticed those recently, just connected the dots [14:50:59] cdanis: Thanks! [14:51:33] 02:12 Synchronized php-1.33.0-wmf.21/extensions/EventLogging/includes/ApiJsonSchema.php: If280a4056a (duration: 00m 48s) [14:51:33] 02:11 Synchronized php-1.33.0-wmf.21/extensions/EventLogging/includes/RemoteSchema.php: If280a4056a (duration: 00m 51s) are the relevant lines. [14:51:38] (03PS1) 10Jbond: Add option to filter out services which dont actully need a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 [14:52:51] (03CR) 10Jbond: "this would deprecate https://gerrit.wikimedia.org/r/c/operations/debs/debdeploy/+/493697" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 (owner: 10Jbond) [14:54:16] James_F: everything allright? [14:54:57] Krinkle: Yes, zeljkof was just worried that T91454 didn't seem resolved and was worried about proceeding with the train. Everything's fine. [14:54:58] T91454: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454 [14:55:04] k [14:55:20] Krinkle: sorry, I got confused with SAL not working :( [14:55:46] Is someone on fixing SAL? (and who would that be?) [14:56:09] Krinkle: Look at Gergo's WIP patches [14:56:12] It's not quite so simple [14:56:12] Krinkle: https://phabricator.wikimedia.org/T218608 [14:56:32] and workaround: https://tools.wmflabs.org/sal [14:57:39] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [14:57:46] ah, right. it's only the wiki part, and stashbot won't give up or do wikitech first. [14:57:48] perfect [14:57:51] jynus: thanks [14:58:12] Greg send the tip after I complained, thank him :-) [14:58:22] *sent [14:59:51] (03PS1) 10Pmiazga: Enable Advanced Mobile Contributions mode for ar,id,es and test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497500 [15:01:05] (03PS2) 10Jbond: Add option to filter out services which dont actully need a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 [15:01:15] (03CR) 10jerkins-bot: [V: 04-1] Enable Advanced Mobile Contributions mode for ar,id,es and test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497500 (owner: 10Pmiazga) [15:02:46] (03PS2) 10Marostegui: db-eqiad.php: Set s2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497472 (https://phabricator.wikimedia.org/T187960) [15:06:36] (03PS4) 10Andrew Bogott: toolforge: Install qstat-full in /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/497216 (https://phabricator.wikimedia.org/T218504) (owner: 10BryanDavis) [15:06:55] (03PS1) 10Mathew.onipe: elasticsearch: remove from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/497503 (https://phabricator.wikimedia.org/T218315) [15:07:37] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/497503 (https://phabricator.wikimedia.org/T218315) (owner: 10Mathew.onipe) [15:08:58] (03PS5) 10Andrew Bogott: toolforge: Install qstat-full in /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/497216 (https://phabricator.wikimedia.org/T218504) (owner: 10BryanDavis) [15:10:24] (03PS1) 10Vgutierrez: acme-chief: Use relative symlinks for the live one [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497504 (https://phabricator.wikimedia.org/T218685) [15:11:10] (03PS36) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [15:11:14] (03PS2) 10Pmiazga: Enable Advanced Mobile Contributions mode for ar,id,es and test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497500 [15:11:16] (03Abandoned) 10DCausse: [cirrus] Start using local nginx reverse proxy for connections reuse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488895 (https://phabricator.wikimedia.org/T215491) (owner: 10DCausse) [15:11:19] (03PS2) 10Mathew.onipe: elasticsearch: remove from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/497503 (https://phabricator.wikimedia.org/T218315) [15:11:21] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Install qstat-full in /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/497216 (https://phabricator.wikimedia.org/T218504) (owner: 10BryanDavis) [15:12:24] !log eqiad A7 servers uplink move - T187960 [15:12:26] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [15:12:27] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [15:15:22] (03CR) 10Andrew Bogott: "Does this disable emails by default on all projects, except where overridden by hiera? Or am I reading this backwards?" [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) (owner: 10GTirloni) [15:16:13] zeljkof: can you hold the train for 15 minutes? [15:16:24] so we can do the network maintenance on the s2 database master? [15:16:27] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10monitoring: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10CDanis) @fgiunchedi we should set this device to 0 weight in the rings, yes? Happy to do the change if you'll review [15:17:38] (03CR) 10Marostegui: db-eqiad.php: Set s2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497472 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [15:17:59] marostegui: sure, I didn't even start yet, technical problems :/ [15:18:05] zeljkof: thanks! [15:18:09] will let you know when done [15:18:10] marostegui: please let me know when you're done so I can cut the branch [15:18:13] thanks! [15:18:17] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Set s2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497472 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [15:18:45] (03PS18) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [15:18:47] (03PS1) 10Dzahn: redis: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497508 [15:18:51] (03PS1) 10Dzahn: pybal: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497509 [15:18:53] (03PS1) 10Dzahn: dumps: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497510 [15:18:55] (03PS1) 10Dzahn: jenkins: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497511 [15:18:57] (03PS1) 10Dzahn: varnish/trafficserver: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 [15:19:56] (03Merged) 10jenkins-bot: db-eqiad.php: Set s2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497472 (https://phabricator.wikimedia.org/T187960) (owner: 10Marostegui) [15:20:04] (03PS1) 10Marostegui: Revert "db-eqiad.php: Set s2 on read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497513 [15:21:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s2 database master on read only - T187960 (duration: 00m 48s) [15:21:12] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:21:13] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [15:22:37] stopping mysql on db1066 [15:22:45] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Set s2 on read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497513 (owner: 10Marostegui) [15:22:47] (03PS2) 10Vgutierrez: acme-chief: Use relative symlinks for the live one [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497504 (https://phabricator.wikimedia.org/T218685) [15:22:48] I see it down already/cannot connect [15:23:25] still stopping [15:23:36] yeah, it takes time :-) [15:23:42] don't worry [15:23:51] Not worried, just giving the updated :) [15:23:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set s2 on read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497513 (owner: 10Marostegui) [15:23:59] I dumped the buffer pool earlier and disabled it on stop anyways [15:24:05] did you got the coords ? [15:24:07] I merged that to be ready [15:24:07] yep [15:24:10] they are on our etherpad [15:24:11] cool [15:24:50] starting [15:25:12] I see it up [15:26:10] Are all wikis back up now? [15:26:17] (03PS6) 10Jbond: Create and mtail parser for ulogd and install it on the syslog server [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) [15:26:29] RhinosF1: read only still on s2 [15:27:07] Thanks. [15:27:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s2 read only OFF - T187960 (duration: 00m 26s) [15:27:57] Is there an ETC? [15:27:58] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:27:59] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [15:27:59] (03CR) 10Alex Monk: [C: 03+2] acme-chief: Use relative symlinks for the live one [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497504 (https://phabricator.wikimedia.org/T218685) (owner: 10Vgutierrez) [15:28:03] s2 is back writable [15:28:10] fatal monitor just exploded with a lot of `LoadBalancer.php: Cannot access the database: Unknown error` [15:28:15] !log mobrovac@deploy1001 Started deploy [restbase/deploy@62df8c3]: Update the docs page title; deploy v0.19.3 [15:28:15] mobrovac@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:28:17] marostegui: is that you? [15:28:21] zeljkof: probably because of the read only I guess [15:28:32] seems to be going down [15:28:46] (03CR) 10Alex Monk: [C: 04-2] "wait" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497504 (https://phabricator.wikimedia.org/T218685) (owner: 10Vgutierrez) [15:28:49] jynus: I see writes already, but let's check [15:29:16] I am not talking because I am checking [15:29:22] ta [15:29:23] and seen nothing bad so far [15:29:47] I did some test edits, of course [15:29:50] I can edit too [15:29:53] (03PS3) 10Vgutierrez: acme-chief: Use relative symlinks for the live one [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497504 (https://phabricator.wikimedia.org/T218685) [15:30:01] and so did RhinosF1 [15:30:05] :) [15:30:13] Enwikitionary Is up [15:30:22] RhinosF1: :) [15:30:53] Np [15:31:34] (03CR) 10Jbond: Create and mtail parser for ulogd and install it on the syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496776 (https://phabricator.wikimedia.org/T215277) (owner: 10Jbond) [15:31:58] (03CR) 10Alex Monk: [C: 03+2] "ok" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497504 (https://phabricator.wikimedia.org/T218685) (owner: 10Vgutierrez) [15:32:06] (03CR) 10Nikerabbit: Cron to run script to purge old CX drafts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [15:32:28] (03CR) 10Alex Monk: [C: 03+2] acme-chief: Ensure that the CN is part of the SNI list for certs config [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497439 (https://phabricator.wikimedia.org/T218418) (owner: 10Alex Monk) [15:32:30] (03CR) 10Alex Monk: [C: 03+2] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497440 (https://phabricator.wikimedia.org/T207295) (owner: 10Alex Monk) [15:32:37] (03PS1) 10Alex Monk: acme-chief: Use relative symlinks for the live one [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497521 (https://phabricator.wikimedia.org/T218685) [15:33:31] (03Merged) 10jenkins-bot: acme-chief: Use relative symlinks for the live one [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497504 (https://phabricator.wikimedia.org/T218685) (owner: 10Vgutierrez) [15:33:32] zeljkof: btw, we are done [15:33:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:33:38] marostegui: all done? [15:33:43] addshore: not x1 yet [15:33:45] that was s2 [15:33:47] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10Marostegui) For the record, read only times on s2: Read only ON: 15:21:10 Read only OFF: 15:27:56 [15:33:48] okay! [15:33:50] marostegui: great, I'll cut the branch then [15:33:51] (03PS4) 10Alex Monk: Release 0.13 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497444 (https://phabricator.wikimedia.org/T218685) [15:33:51] I will ping you for s2 [15:33:56] for x1, I mean XD [15:33:58] I have another meeting now, but if you shout loud enough me or Amir1 will see it [15:34:07] wildo [15:34:09] (03Merged) 10jenkins-bot: acme-chief: Ensure that the CN is part of the SNI list for certs config [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497439 (https://phabricator.wikimedia.org/T218418) (owner: 10Alex Monk) [15:35:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:36:55] marostegui: was this db related --^ ? [15:36:56] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Use relative symlinks for the live one [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497521 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:37:04] elukey: I guess so yep [15:37:06] (didn't check logs yet) [15:37:19] elukey: we had a spike of errors during the read only [15:37:26] ack ack [15:37:32] just wanted to keep things under control [15:37:51] (03CR) 10Alex Monk: [C: 03+2] Release 0.13 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497444 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:38:11] (03CR) 10Vgutierrez: [C: 03+2] Release 0.13 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497444 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:38:41] https://logstash.wikimedia.org/goto/ac2b98ff1bf2925eb36e6c82b146fff5 [15:38:46] (03PS3) 10Alex Monk: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497440 (https://phabricator.wikimedia.org/T207295) [15:38:54] (03CR) 10Alex Monk: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497440 (https://phabricator.wikimedia.org/T207295) (owner: 10Alex Monk) [15:38:57] (03CR) 10Alex Monk: [C: 03+2] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497440 (https://phabricator.wikimedia.org/T207295) (owner: 10Alex Monk) [15:39:07] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [15:39:13] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, 10Wikimedia-Logstash: Remove elasticsearch icinga checks from logstash data nodes - https://phabricator.wikimedia.org/T218691 (10Mathew.onipe) [15:39:27] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, 10Wikimedia-Logstash: Remove elasticsearch icinga checks from logstash data nodes - https://phabricator.wikimedia.org/T218691 (10Mathew.onipe) p:05Triage→03High [15:40:42] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@62df8c3]: Update the docs page title; deploy v0.19.3 (duration: 12m 27s) [15:40:43] mobrovac@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:40:58] !log mobrovac@deploy1001 Started deploy [restbase/deploy@62df8c3]: Update the docs page title; deploy v0.19.3, take #2 [15:40:59] mobrovac@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:41:51] (03PS26) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [15:41:53] (03Merged) 10jenkins-bot: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497440 (https://phabricator.wikimedia.org/T207295) (owner: 10Alex Monk) [15:42:19] (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:43:40] (03PS5) 10Vgutierrez: Release 0.13 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497444 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:45:56] !log disable pybal on lvs1004 [15:45:57] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [15:46:15] (03CR) 10Mathew.onipe: "PCC is ok and expected: https://puppet-compiler.wmflabs.org/compiler1002/15212/" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:48:09] PROBLEM - PyBal backends health check on lvs1004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [15:48:21] PROBLEM - pybal on lvs1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:49:14] (03CR) 10jenkins-bot: Release 0.13 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497444 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:49:23] RECOVERY - pybal on lvs1004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [15:49:35] !log enable pybal on lvs1004 [15:49:35] godog, herron : when you have some time, https://phabricator.wikimedia.org/T218691 [15:49:35] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [15:49:57] (03PS2) 10Alex Monk: acme-chief: Use relative symlinks for the live one [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497521 (https://phabricator.wikimedia.org/T218685) [15:49:58] (03CR) 10Alex Monk: [C: 03+2] acme-chief: Use relative symlinks for the live one [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497521 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:50:03] there is neon with a soft, but not sure what that host does any longer [15:50:09] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:50:10] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:50:11] RECOVERY - PyBal backends health check on lvs1004 is OK: PYBAL OK - All pools are healthy [15:50:12] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:50:12] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:50:12] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:50:12] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:50:22] (03PS1) 10Alex Monk: Release 0.13 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497527 (https://phabricator.wikimedia.org/T218685) [15:50:35] PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:50:51] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:50:51] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:50:54] this is me --^ [15:50:55] sigh [15:51:01] PROBLEM - Hadoop NodeManager on analytics1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:01] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:04] ah, I though it was network [15:51:05] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:07] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:07] PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:08] PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:08] then, cool [15:51:09] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:11] PROBLEM - Hadoop NodeManager on analytics1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:11] PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:11] PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:11] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:13] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:15] PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:17] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:19] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:19] PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:21] PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:23] PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:23] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:25] PROBLEM - Hadoop NodeManager on analytics1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:25] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:27] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:28] (03Merged) 10jenkins-bot: acme-chief: Use relative symlinks for the live one [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497521 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:51:29] PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:35] PROBLEM - Hadoop NodeManager on analytics1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:45] PROBLEM - Hadoop NodeManager on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:49] (03CR) 10Alex Monk: [C: 03+2] Release 0.13 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497527 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:51:57] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:57] PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:52:01] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:52:07] PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:52:21] (03PS2) 10Alex Monk: Release 0.13 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497527 (https://phabricator.wikimedia.org/T218685) [15:52:24] !log disable pybal on lvs1005 [15:52:25] (03CR) 10Alex Monk: Release 0.13 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497527 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:52:25] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [15:52:28] (03CR) 10Alex Monk: [C: 03+2] Release 0.13 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497527 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:52:45] PROBLEM - Hadoop NodeManager on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:54:03] PROBLEM - YARN NodeManager Node-State on an-worker1080 is CRITICAL: CRITICAL: YARN NodeManager an-worker1080.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:01 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:54:03] PROBLEM - YARN NodeManager Node-State on an-worker1089 is CRITICAL: CRITICAL: YARN NodeManager an-worker1089.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:01 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:54:03] PROBLEM - YARN NodeManager Node-State on analytics1071 is CRITICAL: CRITICAL: YARN NodeManager analytics1071.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:54:03] PROBLEM - YARN NodeManager Node-State on analytics1075 is CRITICAL: CRITICAL: YARN NodeManager analytics1075.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:54:04] PROBLEM - YARN NodeManager Node-State on analytics1064 is CRITICAL: CRITICAL: YARN NodeManager analytics1064.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:54:21] (03Merged) 10jenkins-bot: Release 0.13 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497527 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [15:54:50] !log enable pybal on lvs1005 [15:54:51] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [15:55:46] !log disable pybal on lvs1006 [15:55:46] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [15:55:56] PROBLEM - YARN NodeManager Node-State on analytics1050 is CRITICAL: CRITICAL: YARN NodeManager analytics1050.eqiad.wmnet:8041 Node-State: 19/03/19 15:55:55 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:56:22] PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:56:39] should recover in a bit [15:57:23] !log enable pybal on lvs1006 [15:57:23] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [15:57:46] RECOVERY - YARN NodeManager Node-State on an-worker1079 is OK: OK: YARN NodeManager an-worker1079.eqiad.wmnet:8041 Node-State: RUNNING [15:57:50] RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:57:52] RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:57:52] (03PS1) 10Alex Monk: debian: Add release 0.13 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497532 (https://phabricator.wikimedia.org/T218685) [15:57:54] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:57:56] RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:00] RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:06] RECOVERY - Hadoop NodeManager on analytics1043 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:18] RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING [15:58:28] RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:28] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:30] RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:32] RECOVERY - YARN NodeManager Node-State on analytics1077 is OK: OK: YARN NodeManager analytics1077.eqiad.wmnet:8041 Node-State: RUNNING [15:58:32] RECOVERY - YARN NodeManager Node-State on analytics1071 is OK: OK: YARN NodeManager analytics1071.eqiad.wmnet:8041 Node-State: RUNNING [15:58:32] RECOVERY - YARN NodeManager Node-State on an-worker1080 is OK: OK: YARN NodeManager an-worker1080.eqiad.wmnet:8041 Node-State: RUNNING [15:58:33] RECOVERY - YARN NodeManager Node-State on an-worker1081 is OK: OK: YARN NodeManager an-worker1081.eqiad.wmnet:8041 Node-State: RUNNING [15:58:33] RECOVERY - YARN NodeManager Node-State on an-worker1089 is OK: OK: YARN NodeManager an-worker1089.eqiad.wmnet:8041 Node-State: RUNNING [15:58:42] !log changing password for User:St3f [15:58:43] tzatziki: Failed to log message to wiki. Somebody should check the error logs. [15:58:47] oops [15:59:35] tzatziki: it is ok, logs are woring, just some issue with sal on wikitech [15:59:50] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:59:50] RECOVERY - YARN NodeManager Node-State on analytics1070 is OK: OK: YARN NodeManager analytics1070.eqiad.wmnet:8041 Node-State: RUNNING [15:59:51] RECOVERY - YARN NodeManager Node-State on an-worker1091 is OK: OK: YARN NodeManager an-worker1091.eqiad.wmnet:8041 Node-State: RUNNING [15:59:52] RECOVERY - YARN NodeManager Node-State on an-worker1086 is OK: OK: YARN NodeManager an-worker1086.eqiad.wmnet:8041 Node-State: RUNNING [15:59:53] RECOVERY - YARN NodeManager Node-State on analytics1051 is OK: OK: YARN NodeManager analytics1051.eqiad.wmnet:8041 Node-State: RUNNING [15:59:54] RECOVERY - YARN NodeManager Node-State on analytics1064 is OK: OK: YARN NodeManager analytics1064.eqiad.wmnet:8041 Node-State: RUNNING [15:59:55] RECOVERY - YARN NodeManager Node-State on analytics1075 is OK: OK: YARN NodeManager analytics1075.eqiad.wmnet:8041 Node-State: RUNNING [15:59:56] RECOVERY - YARN NodeManager Node-State on an-worker1084 is OK: OK: YARN NodeManager an-worker1084.eqiad.wmnet:8041 Node-State: RUNNING [15:59:57] RECOVERY - YARN NodeManager Node-State on analytics1049 is OK: OK: YARN NodeManager analytics1049.eqiad.wmnet:8041 Node-State: RUNNING [15:59:58] RECOVERY - YARN NodeManager Node-State on an-worker1092 is OK: OK: YARN NodeManager an-worker1092.eqiad.wmnet:8041 Node-State: RUNNING [16:00:04] godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:13] tzatziki: https://tools.wmflabs.org/sal [16:00:33] jynus: ah no worries. [16:01:59] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@62df8c3]: Update the docs page title; deploy v0.19.3, take #2 (duration: 21m 01s) [16:01:59] mobrovac@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [16:02:30] !log ayounsi@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns1001.wikimedia.org,service=pdns_recursor [16:02:30] ayounsi@puppetmaster1001: Failed to log message to wiki. Somebody should check the error logs. [16:03:48] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.13 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497532 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [16:05:43] (03CR) 10GTirloni: "Very good point. Setting it in hiera would disable it for all projcts. I can't confirm in Horizon right now but I believe projects that wo" [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) (owner: 10GTirloni) [16:05:56] (03Merged) 10jenkins-bot: debian: Add release 0.13 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497532 (https://phabricator.wikimedia.org/T218685) (owner: 10Alex Monk) [16:06:48] addshore Amir1 we are going to put x1 master on read only [16:06:58] okay! [16:07:02] * addshore watches [16:07:48] addshore: read only on [16:08:20] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns1001.wikimedia.org,service=pdns_recursor [16:08:21] ayounsi@puppetmaster1001: Failed to log message to wiki. Somebody should check the error logs. [16:09:40] addshore: done - read only off [16:09:48] marostegui: awesome [16:10:24] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:10:24] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [16:10:26] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:10:26] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:10:26] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [16:10:27] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [16:11:34] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, 10Wikimedia-Logstash: Remove elasticsearch icinga checks from logstash data nodes - https://phabricator.wikimedia.org/T218691 (10herron) Why do you say the elasticsearch icinga checks are not needed on the logstash elasticsearch data/master nod... [16:11:57] marostegui: only 360 or so [16:12:26] yeah, not too bad [16:12:47] what did addshore want to check? [16:12:54] FYI: I'm still trying to cut the branch [16:13:11] some technical problems, thanks thcipriani for helping me [16:13:18] (03PS1) 10Vgutierrez: Get rid of the no-break space that got in the Requirements section title [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497538 [16:13:26] marostegui: nothing for me to check, I just need to run a script to add any data that might be missing to the cognate db now :) [16:13:29] (03CR) 10CRusnov: [C: 03+2] Add LimitCORE support for uwsgi units. [puppet] - 10https://gerrit.wikimedia.org/r/493294 (owner: 10CRusnov) [16:13:30] jynus: ^^ [16:13:40] (03PS4) 10CRusnov: Add LimitCORE support for uwsgi units. [puppet] - 10https://gerrit.wikimedia.org/r/493294 [16:15:16] (03CR) 10Vgutierrez: [C: 03+2] Get rid of the no-break space that got in the Requirements section title [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497538 (owner: 10Vgutierrez) [16:15:18] !log downtimed labstore1003 for network moves so it doesn't page [16:15:18] bstorm_: Failed to log message to wiki. Somebody should check the error logs. [16:15:22] blah [16:17:01] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:17:29] (03Merged) 10jenkins-bot: Get rid of the no-break space that got in the Requirements section title [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497538 (owner: 10Vgutierrez) [16:18:01] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:18:59] (03CR) 10Arturo Borrero Gonzalez: "Could you please elaborate a bit more on what is this doing, and why is this code here and why this can't be added to some profile, someth" [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [16:20:15] (03CR) 10Arturo Borrero Gonzalez: "> Could you please elaborate a bit more on what is this doing, and" [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [16:20:44] (03CR) 10Andrew Bogott: "That sounds right -- probably we want the default to be 'true' in the puppet code, and then individual projects can opt out by setting 'fa" [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) (owner: 10GTirloni) [16:22:06] (03CR) 10Alex Monk: "It's not necessarily specific to deployment-prep, you'd want this to set acme-chief up anywhere in labs." [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [16:22:10] (03PS1) 10Vgutierrez: Release 0.14 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497541 [16:23:16] (03CR) 10Alex Monk: "There's not much more to say about what this is doing than what's already in the commit message, it could be moved to a profile but that's" [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [16:24:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:24:37] (03CR) 10Alex Monk: [C: 03+2] Release 0.14 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497541 (owner: 10Vgutierrez) [16:25:57] (03CR) 10Vgutierrez: acme_chief: Add security::access::config on passive host if realm == labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [16:26:04] (03Merged) 10jenkins-bot: Release 0.14 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/497541 (owner: 10Vgutierrez) [16:27:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:28:00] (03PS1) 10Vgutierrez: Get rid of the no-break space that got in the Requirements section title [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497543 [16:28:02] (03PS1) 10Vgutierrez: Release 0.14 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497544 [16:28:29] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Consistently use HTML attributes with quotes [puppet] - 10https://gerrit.wikimedia.org/r/497346 (owner: 10Fomafix) [16:28:45] that's ongoing troubleshooting by equinix ^ [16:29:53] PROBLEM - Host kafka1012 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:58] !log stop eventlogging's mysql kafka consumers on eventlog1002, eventlogging's db replication on db1108 to ease db1107's maintenance [16:29:58] elukey: Failed to log message to wiki. Somebody should check the error logs. [16:30:01] (03CR) 10Vgutierrez: [C: 03+2] Get rid of the no-break space that got in the Requirements section title [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497543 (owner: 10Vgutierrez) [16:30:07] !log stop eventlogging's mysql kafka consumers on eventlog1002, eventlogging's db replication on db1108 to ease db1107's maintenance [16:30:07] elukey: Failed to log message to wiki. Somebody should check the error logs. [16:30:10] what [16:30:22] (03CR) 10Vgutierrez: [C: 03+2] Release 0.14 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497544 (owner: 10Vgutierrez) [16:30:23] PROBLEM - Host kafka1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:35] elukey: the !log is borked atm [16:30:43] elukey: don't worry about that :) [16:31:49] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I find *.iml mentioned in MediaWiki core's .gitignore. It also does not have a slash there." [puppet] - 10https://gerrit.wikimedia.org/r/497306 (owner: 10Ladsgroup) [16:32:00] (03CR) 10jenkins-bot: Get rid of the no-break space that got in the Requirements section title [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497543 (owner: 10Vgutierrez) [16:32:13] (03Merged) 10jenkins-bot: Release 0.14 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497544 (owner: 10Vgutierrez) [16:32:49] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [16:33:37] I'm hoping that when it comes back up someone will process all failed logs [16:35:30] (03PS1) 10Vgutierrez: debian: Add release 0.14 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497546 [16:36:03] (03PS10) 10KartikMistry: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [16:36:08] (03CR) 10KartikMistry: Cron to run script to purge old CX drafts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [16:36:27] (03CR) 10Cwhite: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/497421 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [16:37:39] (03CR) 10Alex Monk: [C: 03+2] debian: Add release 0.14 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497546 (owner: 10Vgutierrez) [16:38:35] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, 10Wikimedia-Logstash: Remove elasticsearch icinga checks from logstash collectors - https://phabricator.wikimedia.org/T218691 (10Mathew.onipe) [16:39:19] RECOVERY - Host kafka1012 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [16:39:21] RECOVERY - Host kafka1013 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [16:39:25] (03PS7) 10CRusnov: Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497421 (https://phabricator.wikimedia.org/T212526) [16:39:41] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, 10Wikimedia-Logstash: Remove elasticsearch icinga checks from logstash collectors - https://phabricator.wikimedia.org/T218691 (10Mathew.onipe) @herron Sorry, I meant logstash collectors: logstash100[7-9] and logstash200[4-6] [16:41:15] Krenair: BTW, regarding the FW rules for acme-chief servers in labs.. maybe is time to give some love to the puppetization and put that in the labs profile for acme-chief [16:41:29] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:15] !log uploaded acme-chief 0.14 to apt.wikimedia.org (buster) - T218685 T218418 T207295 [16:45:20] vgutierrez: Failed to log message to wiki. Somebody should check the error logs. [16:45:21] T218418: CN + SNI list on config file doesn't match issued certificate on some scenarios - https://phabricator.wikimedia.org/T218418 [16:45:21] T207295: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 [16:45:22] T218685: acme-chief creates absolute symlinks for "live" and relative for "new" on certificate renewal - https://phabricator.wikimedia.org/T218685 [16:45:46] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:45:52] (03CR) 10jenkins-bot: debian: Add release 0.14 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/497546 (owner: 10Vgutierrez) [16:46:41] (03CR) 10CRusnov: [C: 03+2] Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497421 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [16:47:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:48:36] (03PS2) 10Jcrespo: mariadb-snapshots: Run transfer as root, not dump [puppet] - 10https://gerrit.wikimedia.org/r/497491 (https://phabricator.wikimedia.org/T206203) [16:48:41] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, 10Wikimedia-Logstash: Remove elasticsearch icinga checks from logstash collectors - https://phabricator.wikimedia.org/T218691 (10herron) Ah, thanks for clarifying! I agree we probably don't need the full suite of checks on the client nodes, bu... [16:50:57] (03PS1) 10Jforrester: .gitmodules: Update gerrit URI syntax, deprecated ages ago, now unsupported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497554 (https://phabricator.wikimedia.org/T218694) [16:51:06] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The fact that namespaced objects are found in common.yaml for labs is a bug, and we shouldn't use it." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [16:53:55] (03PS1) 10CDanis: eqiad-prod: 0 weight to ms-be1043/sdk1 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/497557 (https://phabricator.wikimedia.org/T218544) [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T1700). [17:05:33] (03PS1) 10Alexandros Kosiaris: gerrit: enable httpd request log [puppet] - 10https://gerrit.wikimedia.org/r/497561 [17:06:02] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:06:07] (03CR) 10Jcrespo: [C: 03+2] "This is a bug, followup to transfer.pp monday deployment." [puppet] - 10https://gerrit.wikimedia.org/r/497491 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [17:06:26] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [17:09:08] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [17:09:51] (03CR) 10Paladox: [C: 03+1] gerrit: enable httpd request log [puppet] - 10https://gerrit.wikimedia.org/r/497561 (owner: 10Alexandros Kosiaris) [17:10:15] (03PS3) 10Volans: acme_chief: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/497431 (owner: 10Alex Monk) [17:10:23] (03CR) 10Paladox: [C: 03+1] "You may want to change https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/log4j.xml.erb#L29 depending on the thr" [puppet] - 10https://gerrit.wikimedia.org/r/497561 (owner: 10Alexandros Kosiaris) [17:10:28] Krenair: I'm just testing a thing in gerrit (FYI) ^^^ [17:10:30] (03PS1) 10CRusnov: Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497563 (https://phabricator.wikimedia.org/T212526) [17:12:01] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) [17:14:24] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@64f09a0]: T150377: Bump wikimedia-page-library to 6.3.0 [17:14:26] mholloway-shell@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:14:27] T150377: Use 'outside' instead of 'Inside' for list-style in Mobile version so that long line bullet list items are more scannable - https://phabricator.wikimedia.org/T150377 [17:16:54] !log started "foreachwikiindblist wiktionary extensions/Cognate/maintenance/populateCognatePages.php --batch-size 1000" in a screen on mwdebug1002 (catching up cognate after x1 readonly time) [17:16:54] addshore: Failed to log message to wiki. Somebody should check the error logs. [17:16:57] bah [17:17:21] (03CR) 10Cwhite: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/497563 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [17:18:11] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@64f09a0]: T150377: Bump wikimedia-page-library to 6.3.0 (duration: 03m 47s) [17:18:15] mholloway-shell@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:19:05] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [17:20:14] 10Operations, 10Discovery, 10Discovery-Search (Current work): Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10Mathew.onipe) [17:20:19] (03PS2) 10CRusnov: Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497563 (https://phabricator.wikimedia.org/T212526) [17:20:23] cscott: i just assigned https://phabricator.wikimedia.org/T218702 to you =] ;) [17:20:45] jouncebot: now [17:20:45] For the next 0 hour(s) and 39 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T1700) [17:21:25] 10Operations, 10Discovery, 10Discovery-Search (Current work): Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10Mathew.onipe) a:03Mathew.onipe [17:21:54] (03CR) 10CRusnov: [C: 03+2] Add configuration file for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/497563 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [17:22:09] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [17:22:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:23:32] is it okay if I run a maintenance script, or are we still in incident winddown mode where I shouldn’t disturb anything? [17:23:37] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_conntrack] [17:24:36] !log changing email for User:St3f [17:24:37] tzatziki: Failed to log message to wiki. Somebody should check the error logs. [17:24:42] yeah yeah [17:26:19] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [17:26:56] (I guess if the train isn’t done yet I don’t want to disturb that either) [17:27:56] mobileapps deploy failed for group default3, retrying [17:28:06] (will log manually later) [17:29:26] marostegui: you can use !log but it doesn't end up on wikitech. Does end up in other SAL stuff [17:29:57] wrong m* person [17:30:10] ^ mdholloway [17:30:26] Reedy: jynus: ah, thanks [17:30:37] !log mobileapps deploy failed for group default3, retrying [17:30:37] mdholloway: Failed to log message to wiki. Somebody should check the error logs. [17:31:07] this is "the other stuff" https://tools.wmflabs.org/sal [17:31:22] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@64f09a0]: T150377: Bump wikimedia-page-library to 6.3.0 [17:31:24] mholloway-shell@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:31:24] T150377: Use 'outside' instead of 'Inside' for list-style in Mobile version so that long line bullet list items are more scannable - https://phabricator.wikimedia.org/T150377 [17:31:35] also https://twitter.com/wikimediatech still works [17:32:04] (03PS6) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [17:33:12] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@64f09a0]: T150377: Bump wikimedia-page-library to 6.3.0 (duration: 01m 50s) [17:33:14] mholloway-shell@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:33:47] !log contint1001 / CI going for a quick scheduled maintenance -network cable being moved- [17:33:47] hasharAway: Failed to log message to wiki. Somebody should check the error logs. [17:33:50] ... [17:34:35] (03CR) 10jerkins-bot: [V: 04-1] Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [17:37:14] (03PS7) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [17:37:27] (03PS1) 10Herron: logstash: move mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497570 (https://phabricator.wikimedia.org/T213899) [17:38:26] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis: graph server temperature metrics - https://phabricator.wikimedia.org/T209863 (10CDanis) 05Open→03Resolved https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=mw1222&var-datasource=eqiad%20prometheus%2Fops&v... [17:40:59] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/15213/mw1221.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/497570 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [17:42:32] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I just checked with giovanni:" [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [17:42:41] (03CR) 10Krinkle: [C: 03+2] .gitmodules: Update gerrit URI syntax, deprecated ages ago, now unsupported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497554 (https://phabricator.wikimedia.org/T218694) (owner: 10Jforrester) [17:42:54] (03CR) 10CRusnov: "The report configuration is deployed, so this should be good to go." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [17:43:25] I'm pulling down that wmf-config change to deploy1001 to avoid surprises, but won't deploy it. [17:43:35] (nothing to deploy) [17:43:52] (03Merged) 10jenkins-bot: .gitmodules: Update gerrit URI syntax, deprecated ages ago, now unsupported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497554 (https://phabricator.wikimedia.org/T218694) (owner: 10Jforrester) [17:45:25] 10Operations, 10Release-Engineering-Team, 10Stashbot: Stashbot: Failed to log message to wiki. SAL can't write to wikitech.wikimedia.org - https://phabricator.wikimedia.org/T218708 (10hashar) [17:45:43] Hi, I accidentally hit the wrong shortcuts and deleted Steward requests/Global permissions. I have problems to restore it. Is anyone here to give some advice? [17:47:12] wat [17:47:18] Link to page? [17:47:36] 10Operations, 10Release-Engineering-Team, 10Stashbot, 10Toolforge: Stashbot: Failed to log message to wiki. SAL can't write to wikitech.wikimedia.org - https://phabricator.wikimedia.org/T218708 (10hashar) I have looked at the bots doc on: https://wikitech.wikimedia.org/wiki/Tool:Stashbot https://wikitech... [17:47:37] https://meta.wikimedia.org/w/index.php?title=Steward_requests/Global_permissions&action=edit&redlink=1 [17:47:41] You're right! [17:47:45] Oh my gosh [17:47:47] !log Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds (T216270) [17:47:50] Lucas_WMDE: Failed to log message to wiki. Somebody should check the error logs. [17:47:51] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [17:47:52] T216270: Update property suggester data - https://phabricator.wikimedia.org/T216270 [17:48:23] lol, 20k edits [17:48:46] lol [17:48:50] 10Operations, 10Release-Engineering-Team, 10Stashbot, 10Toolforge: Stashbot: Failed to log message to wiki. SAL can't write to wikitech.wikimedia.org - https://phabricator.wikimedia.org/T218708 (10bd808) [17:48:54] atleast it wasnt main page [17:49:03] so shall we try to restore it by batches of idk how many edits or you can just restore it from shell? :) [17:49:13] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:49:17] I think they are able to do it pretty fast [17:49:27] Maybe restore the first 100 for now, and sort the rest out later? [17:50:06] that might make it even more messy, heh [17:50:36] Could try it again, but I don't want to interfere with other people's try ... [17:51:00] jesus is there no paging on Special:undelete! [17:51:26] the undelete page barely loads for me [17:51:26] my attempts to just restore all edits resulted in either Service Temporarily Unavailable or [XJEnTQpAAD0AAAkBO8gAAADF] 2019-03-19 17:31:09: Критичний виняток типу «Wikimedia\Rdbms\DBQueryError» [17:51:41] let's try shell [17:52:28] bawolff: Huh? [17:52:49] It tries to load all the revision entries on one page [17:52:57] Cladis: That error is "Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.48.15)" [17:53:48] so it is a pure db timeout? heh [17:53:54] not a timeout [17:54:13] took too long [17:54:21] it is something trying to get an exclusive lock on the same rows [17:54:50] (03PS1) 10Ladsgroup: Add wikimania as a special group to wikidata sitelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497578 (https://phabricator.wikimedia.org/T217730) [17:54:55] Probably not helped with multiple people trying to do it... [17:55:04] indeed [17:55:08] yeah <_< [17:55:09] Expectation (readQueryTime <= 5) by MediaWiki::main not met (actual: 15.582684993744) [17:55:26] So it could be the multiple people caused some sort of deadlock [17:55:35] or maybe its just slow because its 20,000 revisions [17:55:40] Or both [17:55:48] probably both [17:55:52] (03Abandoned) 10EddieGP: Add note: most stuff here not used in cloud vps hiera [labs/private] - 10https://gerrit.wikimedia.org/r/423233 (owner: 10EddieGP) [17:56:19] !log shutdown scp1001 for uplink move [17:56:19] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [17:56:20] It's approx. 19k revisions. If I would try this with my laptop, it would've died. Unfortunately I worked with my desktop comuter, with much stronger hardware. [17:56:39] It's not really anything to do with your laptop/computer [17:57:46] reedy@deploy1001:/srv/mediawiki-staging$ mwscript undelete.php --wiki=metawiki --user="Reedy (WMF)" --reason="Accidental delete" "Steward_requests/Global_permissions" [17:57:46] Undeleting Steward_requests/Global_permissions...[Tue Mar 19 17:57:26 2019] [hphp] [21187:7fc3c22133c0:0:000001] [] [17:57:46] Fatal error: Stack overflow in /srv/mediawiki-staging/php-1.33.0-wmf.21/includes/title/MediaWikiTitleCodec.php on line 213 [17:57:48] Brilliant [17:58:03] (03Abandoned) 10EddieGP: cloud hiera: Remove unused paths from hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/423190 (owner: 10EddieGP) [17:58:07] yeah once a post request is out your hardware does not matter much [17:58:37] hah [17:58:37] (03Abandoned) 10EddieGP: Remove labs//common.yaml hiera path [labs/private] - 10https://gerrit.wikimedia.org/r/423189 (owner: 10EddieGP) [17:58:38] lol [17:58:42] umm [17:58:49] does that mean it cant be undeleted :P [17:58:59] Its just been one of those days [17:59:11] bawolff: atleast it was not a main page [17:59:28] I think we should go with expanding in batches [17:59:33] *undeleting [17:59:39] well, main page is trivial to undelete actually [17:59:44] not so many revs [17:59:50] (03Abandoned) 10EddieGP: wwwportals: De-duplicate apache vhost code [puppet] - 10https://gerrit.wikimedia.org/r/397770 (owner: 10EddieGP) [17:59:52] (03CR) 10EddieGP: "Nobody interested in this." [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [17:59:54] (03Abandoned) 10EddieGP: mediawiki: Move www.wikimedia.org to wwwportals.conf [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) (owner: 10EddieGP) [17:59:54] https://phabricator.wikimedia.org/T218712 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T1800) [18:00:15] does that script support batches or we should just select like 5k checkboxes in the UI each iteration? [18:00:19] (03Abandoned) 10EddieGP: Remove wikipedia.org vhost [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [18:00:21] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [18:00:26] (03PS1) 10Ladsgroup: Add wikimaniawiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497579 (https://phabricator.wikimedia.org/T217730) [18:01:14] Script has no batching [18:01:19] !log restart pdfrender on scb1003 [18:01:19] jijiki: Failed to log message to wiki. Somebody should check the error logs. [18:01:23] $archive = new PageArchive( $title, RequestContext::getMain()->getConfig() ); [18:01:23] $this->output( "Undeleting " . $title->getPrefixedDBkey() . '...' ); [18:01:23] $archive->undelete( [], $reason ); [18:01:23] $this->output( "done\n" ); [18:01:43] Maybe just via web interface [18:01:58] !log stopping pybal on lvs1001 [18:01:58] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [18:02:29] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time https://phabricator.wikimedia.org/T174916 [18:03:17] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.102 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:03:21] o_O [18:03:31] wat [18:03:42] Bsadowski1: ? [18:03:48] Did the train run? [18:03:51] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206 (10EddieGP) 05Open→03Declined [18:04:00] Niharika: Nope [18:04:04] Not quite sure what happened... [18:04:13] gerrit issues earlier.. some CI patches failing and stuff [18:04:23] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [18:04:33] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [18:04:45] Ah, okay. [18:05:21] I'm guessing https://phabricator.wikimedia.org/T218694 [18:06:04] That sounds baaaaad. [18:06:11] * paladox is asking upstream about that. [18:06:31] Niharika: gerrit love their breaking changes, unfortunately [18:06:55] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [18:07:19] Reedy i doin't think that's expected. [18:08:01] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [18:08:25] PROBLEM - PyBal connections to etcd on lvs1001 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=4) [18:08:41] lvs1001 is under maintenance [18:09:00] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10monitoring, 10Patch-For-Review: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10jijiki) p:05Triage→03Normal [18:09:23] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [18:09:26] 10Operations, 10ops-codfw: Degraded RAID on labtestservices2002 - https://phabricator.wikimedia.org/T218405 (10jijiki) p:05Triage→03Normal [18:09:50] !log starting pybal on lvs1001 [18:09:50] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [18:10:23] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:10:25] 10Operations, 10ops-codfw: Degraded RAID on labtestcontrol2003 - https://phabricator.wikimedia.org/T218403 (10jijiki) p:05Triage→03Normal [18:10:37] RECOVERY - PyBal backends health check on lvs1001 is OK: PYBAL OK - All pools are healthy [18:10:44] so where do we stand with Steward requests/Global permissions? [18:11:04] 10Operations, 10Gerrit, 10Phabricator, 10Security-Team: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10jijiki) p:05Triage→03Normal [18:11:15] I restored the last like 15 revisions [18:11:18] I see bawolff restored a couple of edits [18:11:19] yeah [18:11:30] !log stopping pybal on lvs1002 [18:11:31] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [18:11:32] shall I proceed just restoring via ui? [18:11:43] Yeah i think so [18:11:50] In like batches of 100 or something [18:11:55] or maybe even a 1000 [18:12:14] ok, let's do this [18:12:34] when someone has a minute can they let me know what's happening? (sorry, connectivty lost for hours) [18:13:05] apergos: A page on meta can't be undeleted due to stack overflow & db lock timeout [18:13:25] ah ha [18:13:37] are we sure 2.15.11 introduced the regression? Im not finding the change that did it in https://github.com/GerritCodeReview/gerrit/compare/v2.15.8...v2.15.12 [18:13:37] RECOVERY - PyBal connections to etcd on lvs1001 is OK: OK: 4 connections established with conf1004.eqiad.wmnet:4001 (min=4) [18:13:57] How the hell there is a stack overflow in TitleCodec, I don't know... [18:14:03] PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [18:14:05] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [18:15:07] 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10crusnov) [18:15:10] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) 05Open→03Resolved [18:17:47] !log starting pybal on lvs1002 [18:17:47] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [18:18:11] (03PS4) 10GTirloni: openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) [18:18:18] RECOVERY - pybal on lvs1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:18:18] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [18:18:38] (03PS5) 10GTirloni: openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) [18:19:09] (03CR) 10GTirloni: openstack - Convert cron jobs to systemd timers (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [18:19:39] (03CR) 10jerkins-bot: [V: 04-1] openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [18:19:51] 10Operations, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10jijiki) p:05Triage→03Normal [18:20:11] 10Operations, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10jijiki) p:05Triage→03Normal [18:21:49] trying to restore top 5k edits via ui [18:21:56] 10Operations, 10Gerrit, 10Patch-For-Review: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Paladox) p:05Normal→03High Per T200739#5034407 [18:22:59] 10Operations, 10Gerrit, 10serviceops: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) 05Open→03Resolved Thank you @hashar. I think this issue is resolved for now. [18:23:08] succeeded will process on [18:24:18] 10Operations, 10Gerrit, 10serviceops, 10Patch-For-Review: Convert Gerrit to use H2 as the database - https://phabricator.wikimedia.org/T211139 (10Paladox) [18:24:32] PROBLEM - Host labsdb1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:26:07] (03PS6) 10GTirloni: openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) [18:27:08] PROBLEM - Host analytics1058 is DOWN: PING CRITICAL - Packet loss = 100% [18:27:11] (03CR) 10jerkins-bot: [V: 04-1] openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [18:27:30] 2nd batch success, third resulted in [XJE0SQpAAEUAAHEez6UAAABE] 2019-03-19 18:27:02: Критичний виняток типу «Wikimedia\Rdbms\DBTransactionSizeError» [18:27:33] gonna retry [18:29:08] RECOVERY - Host analytics1058 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [18:32:20] SRG restored. now gonna make sure that revdeleted and suppressed revs are still that [18:33:31] (03PS4) 10Bstorm: wiki replicas: Remove reference to old comment fields [puppet] - 10https://gerrit.wikimedia.org/r/494999 (https://phabricator.wikimedia.org/T212972) [18:35:27] seems to be fine [18:36:50] (03PS7) 10GTirloni: openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) [18:40:10] (03CR) 10GTirloni: "@arturo - I moved long lines into a separate variable but it didn't help much. The lines are still long. I tried breaking the with "\" lik" [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [18:43:39] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Remove reference to old comment fields [puppet] - 10https://gerrit.wikimedia.org/r/494999 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [18:46:08] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [18:46:26] (03PS3) 10ArielGlenn: fix up a bunch of global config values for dumps [puppet] - 10https://gerrit.wikimedia.org/r/497305 [18:46:33] !log failover cr2-eqiad:ae1 VRRP master to cr1 [18:46:33] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [18:51:12] (03PS5) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [18:51:59] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:52:18] ! [remote rejected] HEAD -> refs/publish/production/497430 (change 497430 missing revisions) [18:52:27] when trying to push to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497430/ [18:52:27] (03PS4) 10ArielGlenn: fix up a bunch of global config values for dumps [puppet] - 10https://gerrit.wikimedia.org/r/497305 [18:52:30] (03CR) 10jerkins-bot: [V: 04-1] debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [18:53:03] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:53:18] even if I push origin HEAD:refs/for/production [18:53:20] vgutierrez, ^ [18:55:01] Hmm, maybe a side effect of the cleanup, I bet [18:55:53] I'm not sure that one was ever vandalised though [18:56:08] (03PS5) 10ArielGlenn: fix up a bunch of global config values for dumps [puppet] - 10https://gerrit.wikimedia.org/r/497305 [18:57:11] ah, it was [18:58:04] sounds like the cleanup needs to be cleaned up [18:59:54] I expect no one tried a push on a cleaned up commit, but I was gone for some hours with connection troubles, so can't be sure [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T1900) [19:01:12] I'll make a ticket. [19:02:01] thanks [19:02:33] !log disable cr2-eqiad:ae1 [19:02:34] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [19:03:39] (03CR) 10ArielGlenn: [C: 03+2] fix up a bunch of global config values for dumps [puppet] - 10https://gerrit.wikimedia.org/r/497305 (owner: 10ArielGlenn) [19:04:21] RECOVERY - Disk space on labmon1001 is OK: DISK OK [19:05:00] (03CR) 10Alex Monk: "I have a new PS for this change but see T218723" [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [19:05:13] (03PS5) 10Arturo Borrero Gonzalez: wmcs: decommision several codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218023) [19:05:43] (03CR) 10Alex Monk: [C: 04-2] "needs reuploading to acme-chief.git" [software/certcentral] - 10https://gerrit.wikimedia.org/r/460397 (https://phabricator.wikimedia.org/T207374) (owner: 10Alex Monk) [19:09:20] !log rebooted labmon1001 [19:09:20] gtirloni: Failed to log message to wiki. Somebody should check the error logs. [19:11:19] PROBLEM - Host labmon1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:52] and paged [19:11:54] gtirloni: that paged :-P better to downtime stuff beforehand [19:12:27] only the oncall received it right? ;) [19:12:31] sorry about the noise [19:13:15] ACKNOWLEDGEMENT - Host labmon1001 is DOWN: PING CRITICAL - Packet loss = 100% Arturo Borrero Gonzalez rebooting [19:14:20] RECOVERY - Host labmon1001 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [19:14:54] for europe some people will still be in the call window... anyways next time [19:15:08] gtirloni: define oncall :D [19:17:49] Krenair: thanks for reporting that, I've tried something, could you retry again please? (re: 497430) [19:18:09] (03PS3) 10Alex Monk: acme_chief: Add security::access::config on passive host if realm == labs [puppet] - 10https://gerrit.wikimedia.org/r/497430 [19:18:15] volans, yay, thanks [19:18:29] great, thank you :) [19:18:40] did you reindex the project too? [19:18:49] *change [19:18:54] ticket is https://phabricator.wikimedia.org/T218723 btw [19:19:05] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Add security::access::config on passive host if realm == labs [puppet] - 10https://gerrit.wikimedia.org/r/497430 (owner: 10Alex Monk) [19:19:18] classic jenkins [19:19:30] good to know [19:20:55] (03PS4) 10Alex Monk: acme_chief: Add security::access::config on passive host if realm == labs [puppet] - 10https://gerrit.wikimedia.org/r/497430 [19:21:48] everything should have been reindexed [19:22:02] at least, commands were run, presumably nothing got missed [19:23:04] (03PS1) 10Herron: rsyslog: add omkafka load statement to udp_json_logback config [puppet] - 10https://gerrit.wikimedia.org/r/497587 (https://phabricator.wikimedia.org/T213899) [19:23:42] (03PS1) 10Jcrespo: Revert "db-eqiad.php: Depool hosts in row A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497589 [19:23:57] (03CR) 10Jcrespo: [C: 04-2] "No green light yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497589 (owner: 10Jcrespo) [19:25:54] (03PS1) 10Alex Monk: openstack proxyleaks: Rm check for old proxy-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/497590 [19:27:04] anyone planning to do the mediawiki release train in this window? if not, I will be performing a failover between Icinga hosts [19:27:06] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1023 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [19:28:25] self recovered --^ [19:28:37] looks like self-recovered an hour ago [19:29:37] (03PS1) 10CDanis: icinga: failover to icinga1001 [dns] - 10https://gerrit.wikimedia.org/r/497591 [19:29:48] !log ariel@deploy1001 Started deploy [dumps/dumps@da66149]: move maxretries to config [19:29:48] ariel@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:29:51] !log ariel@deploy1001 Finished deploy [dumps/dumps@da66149]: move maxretries to config (duration: 00m 03s) [19:29:52] ariel@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:31:27] (03PS1) 10CDanis: icinga: failover to icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/497592 [19:35:47] !log enable cr2-eqiad:ae1 [19:35:47] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [19:36:54] (03CR) 10CDanis: [C: 03+2] icinga: failover to icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/497592 (owner: 10CDanis) [19:36:59] !log failing over icinga to icinga1001 [19:36:59] cdanis: Failed to log message to wiki. Somebody should check the error logs. [19:38:52] (03CR) 10CDanis: [C: 03+2] icinga: failover to icinga1001 [dns] - 10https://gerrit.wikimedia.org/r/497591 (owner: 10CDanis) [19:42:54] nice! [19:43:00] metamon worked :D [19:43:00] see the email cdanis ;) [19:43:10] puppet is running on icinga1001 right now btw [19:43:26] !log remove forced failover on cr1/cr2-eqiad [19:43:27] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [19:43:50] (03PS6) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [19:44:59] !log icinga failed over to icinga1001 successfully [19:44:59] cdanis: Failed to log message to wiki. Somebody should check the error logs. [19:45:13] (03CR) 10jerkins-bot: [V: 04-1] debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [19:46:16] (03PS4) 10ArielGlenn: dumps: set up a minimal config file for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/463711 (https://phabricator.wikimedia.org/T205825) [19:46:24] well that was all pretty easy [19:47:18] !log disable asw2-a<->asw-a link [19:47:18] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [19:48:55] (03PS7) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [19:49:01] (03PS1) 10Hashar: Revert "puppet_alert: Email projectadmins instead of members" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) [19:49:13] (03PS2) 10Hashar: Revert "puppet_alert: Email projectadmins instead of members" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) [19:50:09] (03PS4) 10Gehel: elasticsearch: deploy elasticsearch config for ES6 [puppet] - 10https://gerrit.wikimedia.org/r/495921 (https://phabricator.wikimedia.org/T218116) [19:51:44] (03CR) 10Gehel: [C: 03+2] elasticsearch: deploy elasticsearch config for ES6 [puppet] - 10https://gerrit.wikimedia.org/r/495921 (https://phabricator.wikimedia.org/T218116) (owner: 10Gehel) [19:52:04] (03PS3) 10Hashar: Revert "puppet_alert: Email projectadmins instead of members" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) [19:52:46] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [19:52:46] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [19:52:53] (03CR) 10Jbond: "> Patch Set 4: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [19:53:12] (03CR) 10Hashar: "I had to revert it on the integration project since that broke puppet entirely. The revert is applied on integration-puppetmaster01.integ" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [19:54:10] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [19:54:11] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [19:57:01] (03PS2) 10Herron: rsyslog: add omkafka load statement to udp_json_logback config [puppet] - 10https://gerrit.wikimedia.org/r/497587 (https://phabricator.wikimedia.org/T213899) [19:57:36] (03CR) 10BryanDavis: [C: 04-1] "> I had to revert it on the integration project since that broke" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [19:58:50] (03CR) 10Alex Monk: [C: 04-1] "I think you'll have to sort out the conflicting apt pin in modules/zuul/manifests/init.pp - the comment there does say it's basically a te" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [20:01:13] (03CR) 10Herron: [C: 03+2] rsyslog: add omkafka load statement to udp_json_logback config [puppet] - 10https://gerrit.wikimedia.org/r/497587 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [20:01:26] 10Operations, 10ops-eqiad, 10decommission: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10ayounsi) p:05Triage→03Normal [20:03:00] (03PS1) 10Ayounsi: Remove asw-a-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/497601 (https://phabricator.wikimedia.org/T218734) [20:04:04] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, 10Patch-For-Review: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10hashar) Sorry I have lost track of this ticket since March 7th but seems it had ample activity. It is less of a... [20:04:22] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:08:10] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:08:12] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15216/" [puppet] - 10https://gerrit.wikimedia.org/r/497601 (https://phabricator.wikimedia.org/T218734) (owner: 10Ayounsi) [20:08:33] (03PS2) 10Ayounsi: Remove asw-a-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/497601 (https://phabricator.wikimedia.org/T218734) [20:10:55] 10Operations, 10Services, 10VisualEditor, 10Readers-Web-Backlog (Tracking), 10Wikimedia-production-error: [Bug] Sporadic 503 errors when editing - https://phabricator.wikimedia.org/T218252 (10mobrovac) 05Open→03Invalid [20:11:29] (03CR) 10Jcrespo: [C: 04-2] "While this should be ok to go any time now, operations are not 100% finished at this time, so probably you can do it when you wake up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497589 (owner: 10Jcrespo) [20:12:16] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10ayounsi) [20:14:02] 10Operations, 10Services, 10VisualEditor, 10Readers-Web-Backlog (Tracking), 10Wikimedia-production-error: [Bug] Sporadic 503 errors when editing - https://phabricator.wikimedia.org/T218252 (10Krinkle) >>! In T218252#5030786, @Jdlrobson wrote: > I dont know anything about the action=visualeditor. The er... [20:16:18] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [20:16:53] checking that [20:17:23] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [20:17:26] 10Operations, 10netops: Increase network capacity (2018-19 Q3 Goal) - https://phabricator.wikimedia.org/T213122 (10ayounsi) [20:17:40] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) 05Open→03Resolved Everything here is done, thank you all for your help! [20:17:42] (03PS1) 10Hashar: contint: update sury.org gpg key for apt [puppet] - 10https://gerrit.wikimedia.org/r/497605 (https://phabricator.wikimedia.org/T218735) [20:17:46] XioNoX: 'asw-a-eqiad' is not a valid parent for host 'contint1001' [20:17:54] icinga config is broken [20:17:58] Error: 'asw-a-eqiad' is not a valid parent for host 'contint1001' [20:18:17] yeah, probably him :-) [20:18:20] 10Operations, 10ops-eqiad, 10decommission, 10netops, 10Patch-For-Review: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) [20:18:40] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [20:18:48] sorry gotta go afk now [20:19:08] 10Operations, 10ops-eqiad, 10decommission, 10netops, 10Patch-For-Review: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) [20:19:29] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [20:19:32] 10Operations, 10netops: Increase network capacity (2018-19 Q3 Goal) - https://phabricator.wikimedia.org/T213122 (10ayounsi) 05Open→03Resolved [20:19:58] er, me [20:20:27] jynus: only contint1001 is complaining? [20:20:27] maybe a race condition if you are decommision [20:20:32] yep, at the moment [20:20:45] it has been moved to the new switch stack hours ago [20:20:46] that was the only error [20:20:57] that is a puppet error [20:21:00] ah [20:21:02] Puppet is disabled. incident [20:21:20] as in, something makes no sense to puppet like the host is not in icinga or something [20:21:23] so because puppet is disabled it still thinks it's on the old switch stack [20:21:30] I see [20:21:36] ... does the icinga puppet file get generated from cached puppet facts, or something? [20:21:42] s/icinga puppet/icinga config/ [20:21:44] do what you can, like removing the host [20:21:58] Is Wikidata dead? [20:22:05] so I can re-add asw-a-eqiad [20:22:05] as while it is not an immedite blocker, it blocks any new icinga check [20:22:16] yeah, will fix it [20:22:17] sjoerddebruin: WFM [20:22:40] wikidata WFM too sjoerddebruin [20:22:49] cdanis: is there a way to update contint1001 so it knows it's on the new switches? [20:22:53] sjoerddebruin, maybe step into -tech and let's figure out what's up when you try to connect [20:22:55] or puppet [20:22:58] I am trying to figure it out XioNoX [20:23:02] thx [20:25:11] it may need some puppetdb magic, difficult without running puppet [20:25:40] XioNoX: do you know if it is in use? [20:25:58] jynus: it as contint1001? [20:26:01] hashar: ^ [20:26:13] yeah it's in use [20:26:49] is there a good reason to not enable puppet? [20:27:03] and why it cannot run puppet (I don't have a problem, but if it is because some manual changes, maybe a patch can be sent) [20:27:40] can we know who disabled puppet? [20:27:50] yes, it sais it [20:28:28] normally it says who [20:28:39] unless done as sudo -i ? [20:28:43] there is an ack from vgutierrez in icinga for the "puppet last run" icinga check [20:28:57] XioNoX: yeah :-) [20:29:35] if it was done with sudo you could still dig through the logs and have a decent go at figuring out who [20:29:50] if someone logged in as root directly, could try to figure out based on IP [20:29:51] I have a CR almost ready to re-add asw-a-eqiad [20:29:52] there are ways [20:29:57] (03PS1) 10Ayounsi: Icinga: re-add asw-a-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/497606 (https://phabricator.wikimedia.org/T218734) [20:30:04] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10hashar) I have cherry picked https://gerrit.wikimedia.org/r/495681/3 on the CI puppetmaster and I now have: ` $ apt-cache policy jen... [20:30:29] a —dry-run should update exported resources without actually changing anything, if the catalog compiles [20:30:30] cdanis, jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/497606 [20:30:45] ahhh [20:30:52] herron: worth trying :) [20:30:56] I will try that herron thanks [20:31:31] is --dry-run the same as --noop? [20:31:43] sorry about you suffering that, XioNoX clearly not your fault, but it is an important issue, I think [20:31:47] (03CR) 10Hashar: [C: 03+1] "Seems good to me. I have applied the patch on the CI puppet master, the component/ci seems to have been applied properly and jenkins-debia" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495681 (https://phabricator.wikimedia.org/T212774) (owner: 10Jbond) [20:32:12] yeah, and I caused it :) [20:32:14] yeah my bad —noop [20:32:19] XioNoX: not really [20:32:29] the problem was puppet [20:32:44] I ran this on contint1001: puppet agent --noop --test --debug [20:32:49] thanks for the help anyway :) [20:32:49] but do not see any new facts on puppetboard [20:33:45] enable && noop; disable, but I guess risky? [20:33:48] lldp_parent still says "asw-a-eqiad" [20:35:20] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review: Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10hashar) Pending the puppet patch to be merged. Otherwise it is essentially solved :] [20:36:03] cdanis: got paged by icinga2001... [20:36:08] anything under control? :) [20:36:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10hashar) a:03hashar [20:36:20] s/anything/everything/ [20:36:34] so vgutierrez disabled puppet on contint1001 this morning [20:36:44] since then XioNoX has made networking topology changes [20:37:07] but since puppet has been disabled on that host for so long, it still has the old lldp_parent fact in puppetdb [20:37:13] yes that's for the configuration, but why icinga2001 think it his a master? [20:37:22] uhh [20:37:23] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15217/" [puppet] - 10https://gerrit.wikimedia.org/r/497606 (https://phabricator.wikimedia.org/T218734) (owner: 10Ayounsi) [20:39:18] volans: I am going to guess something about sync_icinga_state execution when the master has bad configuration files [20:39:58] as that is the last thing I see happening on icinga2001 [20:40:05] close enough timing wise [20:40:18] I have https://gerrit.wikimedia.org/r/c/operations/puppet/+/497606 ready to be merged if no better solution [20:40:56] and from https://puppetboard.wikimedia.org/fact/lldp_parent/asw-a-eqiad only contint1001 have the issue [20:41:12] why is puppet disabled? [20:41:17] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): hhvm systemd service on deployment-prep reports: hhvm.service: Ignoring invalid environment assignment 'RUN_AS_GROUP=www-data - https://phabricator.wikimedia.org/T209946 (10hashar) a:05hashar→... [20:41:55] maybe was just because failing due to gerrit down? or something else? [20:42:03] no idea [20:42:17] XioNoX: yeah sorry was busy with other duties [20:42:27] what is going on with contint1001? [20:42:36] hashar: any idea why puppet is disabled? [20:42:46] not sure [20:42:59] maybe puppet tries to start zuul automagically when we wanted to keep it down [20:43:13] if there is no message in puppet nor in sal ... I guess we can just bring it back up [20:43:40] I can !log it if you want to take the blame if need be ;D [20:44:57] !log enabling puppet on contint1001 [20:44:57] cdanis: Failed to log message to wiki. Somebody should check the error logs. [20:45:10] got disabled by a.kosiaris this morning [20:45:15] so yeah can just enable it again [20:45:43] (found out via timestamp of /var/lib/puppet/state/agent_disabled.lock and comparing with output of `last` ) [20:46:32] Applied catalog ! [20:46:39] ok XioNoX seems like there is a new lldp_parent now [20:46:43] yay [20:46:50] is icinga happy? [20:47:00] I am going to re-run puppet on those hosts now [20:47:05] https://www.irccloud.com/pastebin/SjULl5KX/ [20:47:08] that look right? [20:49:20] Things look okay - No serious problems were detected during the pre-flight check [20:49:24] on icinga1001 [20:49:31] hashar: yeah [20:49:38] volans: I am going to run sync_icinga_state on icinga2001 and see what happens, ok? [20:49:39] er [20:49:53] herron: ish, seems like this facter should be re-written [20:49:58] cdanis: +1 [20:50:37] ahh [20:50:42] ... interesting... [20:51:54] (03PS4) 10MSantos: Pass flag use_nodejs10 for maps services [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) [20:53:13] volans: so, sync_icinga_state stops the icinga service, does an rsync of all the state files, and then starts it back up [20:53:23] volans: what will happen if the icinga service fails to start because the config files are bad? :D [20:57:10] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [20:57:10] (03PS1) 10Ottomata: eventgate-analytics - adjustments to statsd exporter matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/497612 (https://phabricator.wikimedia.org/T218305) [20:57:43] volans, still about? [20:58:15] !log disable down ports with no description on switches [20:58:16] XioNoX: Failed to log message to wiki. Somebody should check the error logs. [20:58:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - adjustments to statsd exporter matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/497612 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [20:58:38] volans: the answer is that the CGI on icinga2001 will happily read and serve the copied-verbatim-state from icinga1001, which were written out by an icinga that was the active icinga, with notifications enabled, etc [21:05:51] (03PS1) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) [21:06:34] (03CR) 10jerkins-bot: [V: 04-1] ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [21:06:45] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [21:06:45] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:06:46] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [21:06:46] !log otto@deploy1001 scap-helm eventgate-analytics finished [21:06:46] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:06:46] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:07:07] !log cdanis@wikitech-static.wikimedia.org: apt install sshguard [21:07:07] cdanis: Failed to log message to wiki. Somebody should check the error logs. [21:07:30] (03PS2) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) [21:09:02] I'll try cutting the branch now, let me know if there's a problem with that [21:22:14] (03Abandoned) 10Ayounsi: Icinga: re-add asw-a-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/497606 (https://phabricator.wikimedia.org/T218734) (owner: 10Ayounsi) [21:23:41] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [21:23:42] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:23:43] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [21:23:44] !log otto@deploy1001 scap-helm eventgate-analytics finished [21:23:44] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:23:44] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:33:00] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): hhvm systemd service on deployment-prep reports: hhvm.service: Ignoring invalid environment assignment 'RUN_AS_GROUP=www-data - https://phabricator.wikimedia.org/T209946 (10jijiki) p:05Triage→... [21:33:05] (03PS3) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) [21:33:33] (03PS1) 10Ottomata: eventgate-analytics Fix misplaced '_histogram' metric suffix [deployment-charts] - 10https://gerrit.wikimedia.org/r/497645 (https://phabricator.wikimedia.org/T218305) [21:34:07] (03PS2) 10Ottomata: eventgate-analytics Fix misplaced '_histogram' metric suffix [deployment-charts] - 10https://gerrit.wikimedia.org/r/497645 (https://phabricator.wikimedia.org/T218305) [21:34:44] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics Fix misplaced '_histogram' metric suffix [deployment-charts] - 10https://gerrit.wikimedia.org/r/497645 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [21:36:32] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [21:36:32] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:36:34] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [21:36:34] !log otto@deploy1001 scap-helm eventgate-analytics finished [21:36:34] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:36:34] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:40:04] (03PS4) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) [21:40:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10kchapman) This is approved. Thanks! [21:46:29] (03PS5) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) [21:50:36] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [21:50:36] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:50:38] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [21:50:39] !log otto@deploy1001 scap-helm eventgate-analytics finished [21:50:39] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:50:39] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:50:42] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [21:50:42] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:50:44] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [21:50:44] !log otto@deploy1001 scap-helm eventgate-analytics finished [21:50:45] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:50:45] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:53:40] cdanis: ack so icinga didn't start on 2001 after the sync because of the configuration issue but the CGI happily replied? [21:53:58] because when I tested the script if icinga was down I was not getting data IIRC [21:54:31] Krenair: not really, but shoot [21:59:23] volans, I was wondering if there was a way for me to obtain a list of which prod server runs which distro [21:59:26] I assume this would be part of the facts uploaded to the puppet-compiler running inside labs [21:59:28] this would be to inform choices of distributions used in deployment-prep [22:05:29] Come the thought of it, is it all available through the dhcpd installer lines in puppet? [22:07:44] (03CR) 10Jdlrobson: Enable Advanced Mobile Contributions mode for ar,id,es and test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497500 (owner: 10Pmiazga) [22:08:52] there are a lot of hosts in there without installer lines :/ [22:13:32] ? no they should be all there, those without a specific line are stretch that is the default [22:15:07] ah [22:15:21] a specific line overriding the distro, that is [22:16:03] thanks [22:16:07] but if you need specific cluster is probably better to ask the owners as they might have plan to upgrade soon maybe and will be nice to test those in depl-prep before [22:16:21] true [22:16:46] in general most of the fleet is stretch [22:20:11] !log otto@deploy1001 Started deploy [eventlogging/analytics@9aea626]: fix for production error where mw api is returning html instead of json schemas [22:20:11] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [22:20:15] !log otto@deploy1001 Finished deploy [eventlogging/analytics@9aea626]: fix for production error where mw api is returning html instead of json schemas (duration: 00m 04s) [22:20:15] otto@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [22:21:35] volans, are you aware of any in particular with plans to try out buster? [22:21:51] I know acme-chief is already there, analytics has got a couple of buster things and there's a test DB host on buster [22:23:03] Krenair: not in particular, I guess that between now and the official release some more test hosts/early adopters will happen but not too many, while after the release the upgrades will be more frequent. But this is all me guessing ;) [22:23:11] ok [22:23:14] thanks for your help [22:23:22] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Bstorm) This should only be stalled on T216441 at this point. @jcrespo and @Maroste... [22:23:30] yw [22:33:30] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 for comment refactor changes [puppet] - 10https://gerrit.wikimedia.org/r/497652 (https://phabricator.wikimedia.org/T212972) [22:45:55] (03CR) 10Bstorm: [C: 03+2] wiki replicas: depool labsdb1010 for comment refactor changes [puppet] - 10https://gerrit.wikimedia.org/r/497652 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190319T2300). Please do the needful. [23:00:04] Ebe123: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] I'm here! [23:01:12] Ready with https://gerrit.wikimedia.org/r/#/c/497433/ [23:01:28] could you please wait with the swat, we're trying to cut the branch? [23:01:31] o/ [23:01:41] I also have a patch, I can wait [23:01:45] I'm not sure if something might go wrong cc thcipriani [23:01:52] also whoever's swatting, I have a patch and can also wait [23:02:18] Odd jouncebot only pinged me... [23:02:47] it needed refresh, it's fine :D [23:02:59] I can wait; but the config fix is important :) [23:03:31] zeljkof: Oh so the train is finally running? Great! [23:03:58] Niharika: hopefully, still stuck at the very first step :/ [23:04:19] zeljkof: Thanks for working on it. Much appreciated. [23:09:02] (03CR) 10Smalyshev: [C: 03+1] Enable WikibaseCirrusSearch on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490648 (https://phabricator.wikimedia.org/T215684) (owner: 10Jforrester) [23:10:49] (03CR) 10Smalyshev: [C: 03+1] Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490649 (https://phabricator.wikimedia.org/T215684) (owner: 10Jforrester) [23:11:35] (03PS1) 10CRusnov: Create a spicerack module for accessing RAPI [software/spicerack] - 10https://gerrit.wikimedia.org/r/497656 [23:15:40] (03CR) 10jerkins-bot: [V: 04-1] Create a spicerack module for accessing RAPI [software/spicerack] - 10https://gerrit.wikimedia.org/r/497656 (owner: 10CRusnov) [23:18:30] Does anyone know why it's "jerkins" bot and not "jenkins" here? [23:18:32] ^ [23:18:58] Because of the -1 [23:19:07] Amir1: Welcome to the club of people that have spotted this [23:19:11] smart [23:19:13] :P [23:19:17] aka "he's a jerk because he said no" [23:19:40] Bad jenkins [23:20:55] who is doing the SWAT today btw? [23:20:55] yeah somewhere in the wikibugs source is a thing that detects CR-1 votes given by jenkins-bot and overrides the username [23:20:56] :D [23:21:13] er, V-1 [23:25:39] If there's no one, I can do it [23:25:49] but let's see if it works first [23:27:54] Amir1: zeljkof: Reedy: ^ just wondering whom to bug about the patch I threw on the SWAT at the last minute... thx! [23:28:03] no rush also [23:28:26] I'm not swatting :P... No idea if it's going ahead. Not sure what's happening with the train either [23:28:36] oh hmmm [23:28:41] train is still ongoing :/ probably no swat right now [23:28:56] zeljkof: ok thx! [23:29:43] If the problem is with branching etc... I don't quite see why SWAT stuff can't go ahead [23:30:43] It's being (re)branched right now on the deployment host, that's why. :-) [23:31:16] top - 23:31:13 up 195 days, 12:20, 9 users, load average: 0.05, 0.11, 0.10 [23:31:19] It's prettyy idle [23:33:52] I leave for now. If things worked out, ignore mine until I get home, if not, I can do it later [23:40:45] we had much trouble with cutting the branch today; it seems to be working now, if experienced deployers think it's ok to swat, go ahead [23:41:09] it's past midnight here, so I can't swat, I would turn into a gremlin