[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190301T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:36] yeah [00:00:52] it might be interesting to do that check in a very general way across a lot of the fleet that's confd-controlled [00:01:06] is everything with logging back to normal? is it fine to run a noop deploy during this window? (testing scap feature) [00:01:50] (if a service on host X is depooled in confd, raise an icinga critical. if someone's working on the box they should've icinga-disabled it anyways, and it provides a feedback when you go check icinga at the end of your work that "oh yeah I need to repool that") [00:02:16] we could probably write and deploy that check in a very profile-neutral way [00:02:40] alright, updated https://wikitech.wikimedia.org/wiki/Service_restarts#Cache_proxies_%28varnish%29_%28cp%29 so I don't forget in the future [00:03:12] bblack: yeah good point [00:04:59] thanks! [00:05:09] (03CR) 10Smalyshev: "@Jforrester: I think it's ok on Beta, can we move forward with this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T217276) (owner: 10Smalyshev) [00:06:50] looking at scrollback, seems like everything wrt deployment was figured out [00:06:58] * thcipriani does scap fiddling [00:09:56] !log thcipriani@deploy1001 Synchronized README: noop sync to test opcache-manager in scap 3.9.1-1 (duration: 00m 48s) [00:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:27] !log pre-configure asw-a3 ports on asw2-a3-eqiad - T187960 [00:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:32] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [00:13:06] PROBLEM - Long running screen/tmux on analytics-tool1003 is CRITICAL: CRIT: Long running SCREEN process. (user: nuria PID: 10608, 1737636s 1728000s). [00:19:26] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [00:39:30] (03PS7) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [01:15:22] (03PS8) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [02:16:38] (03CR) 10Krinkle: [C: 03+1] Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [02:54:17] (03PS1) 10Paladox: Merge branch 'stable-2.14' into stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493636 [03:15:34] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Tested locally and works with bazel 0.23" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493636 (owner: 10Paladox) [03:33:00] (03Abandoned) 10CRusnov: Update to upstream v2.5.7 tag. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492577 (owner: 10CRusnov) [03:33:14] (03PS1) 10CRusnov: Update to upstream v2.5.7 tag. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/493637 [04:00:48] (03PS9) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [04:01:53] (03CR) 10CRusnov: "note that this has been successfully tested with the -i flag and a json dump from the ganeti api on the af-netbox instance." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [04:05:01] (03PS2) 10CRusnov: Add dummy netbox tokens [labs/private] - 10https://gerrit.wikimedia.org/r/493084 [04:05:35] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add dummy netbox tokens [labs/private] - 10https://gerrit.wikimedia.org/r/493084 (owner: 10CRusnov) [04:51:52] PROBLEM - Long running screen/tmux on an-coord1001 is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 26051, 3932719s 1728000s). [05:51:22] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T217301 (10Marostegui) 05Open→03Resolved Thank you! It looks good now ` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2... [05:54:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493646 [05:57:24] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Marostegui) No problem! let's leave the loop there for a few days to see if it crashes Thank you! [05:57:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493646 (owner: 10Marostegui) [05:58:52] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493646 (owner: 10Marostegui) [05:59:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 (duration: 00m 51s) [05:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:51] 10Operations, 10ops-eqiad, 10DBA: dbproxy1012 power supply without power - https://phabricator.wikimedia.org/T217394 (10Marostegui) [06:04:06] 10Operations, 10ops-eqiad, 10DBA: dbproxy1012 power supply without power - https://phabricator.wikimedia.org/T217394 (10Marostegui) p:05Triage→03Normal [06:05:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493646 (owner: 10Marostegui) [06:10:43] (03PS1) 10Marostegui: install_server: Remove dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/493647 (https://phabricator.wikimedia.org/T216491) [06:11:43] (03CR) 10Marostegui: [C: 03+2] install_server: Remove dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/493647 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [06:28:32] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:30:20] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/spark2_yarn_shuffle_jar_install] [06:31:32] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/hhvm-needs-restart] [06:31:32] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:39:54] <_joe_> this is logrotate I guess [06:40:27] <_joe_> !log upgrading php extensions on deploy* to versions compatible with php7.2 [06:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:27] !log Stop MySQL on db1094 for mysql upgrade [06:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:34] !log Deploy schema change on s4 codfw, lag will appear on s4 codfw - T86342 [06:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:37] T86342: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 [06:51:31] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493648 [06:54:09] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493648 (owner: 10Marostegui) [06:55:07] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493648 (owner: 10Marostegui) [06:56:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1094 after mysql upgrade (duration: 00m 46s) [06:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:20] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:46] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:34] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493648 (owner: 10Marostegui) [07:04:36] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493649 [07:09:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493649 (owner: 10Marostegui) [07:10:58] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493649 (owner: 10Marostegui) [07:12:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1094 after mysql upgrade (duration: 00m 47s) [07:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:50] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493649 (owner: 10Marostegui) [07:18:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: fix php version, add php7 admin port [puppet] - 10https://gerrit.wikimedia.org/r/493485 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [07:22:03] (03PS2) 10Giuseppe Lavagetto: scap: fix php version, add php7 admin port [puppet] - 10https://gerrit.wikimedia.org/r/493485 (https://phabricator.wikimedia.org/T211964) [07:23:26] <_joe_> !log installed php 7.2 compatible packages on deploy1001,2001 [07:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:58] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493650 [07:27:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493650 (owner: 10Marostegui) [07:28:48] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493650 (owner: 10Marostegui) [07:29:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1094 after mysql upgrade (duration: 00m 47s) [07:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:24] <_joe_> marostegui: can I do a test deploy 1 second? [07:30:29] sure! [07:30:31] go ahead! [07:31:31] !log oblivian@deploy1001 Synchronized README: Test deploy for new scap configuration (duration: 00m 46s) [07:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:03] <_joe_> uhm [07:37:51] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493650 (owner: 10Marostegui) [07:39:17] !log oblivian@deploy1001 Synchronized README: noop sync to test opcache-manager (duration: 00m 47s) [07:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:44] <_joe_> marostegui: last attempt I swear [07:44:17] !log oblivian@deploy1001 Synchronized README: Test deploy for new scap configuration (duration: 00m 48s) [07:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:02] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493652 [08:04:54] <_joe_> marostegui: I'm done btw [08:05:00] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) Via HPE CLI I tried to flip /map1/config1/oemHPE_ipmi_dcmi_overlan_enable=yes but didn't work afaics.. [08:15:23] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['labsdb1012.eqiad.wmnet'] ` The log... [08:20:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, to be merged on Mon I think?" [puppet] - 10https://gerrit.wikimedia.org/r/493610 (https://phabricator.wikimedia.org/T200960) (owner: 10Herron) [08:24:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493460 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe) [08:26:24] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [08:26:39] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [08:26:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [08:27:06] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) p:05Triage→03Normal [08:28:03] (03CR) 10Muehlenhoff: "The old repo will eventually go away; it contained PHP packages synced from an external repository, which also rebuilds/upgrades a number " [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis) [08:29:57] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) The above setting seems to have done the trick, now wmf-auto-reimage works.. I got this: ` 08:27:57 | labsdb1012.eqiad.wmnet | WARNI... [08:31:50] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['labsdb1012.eqiad.wmnet'] ` and were **ALL** successful. [08:33:16] 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) 05Open→03Resolved [08:38:31] (03PS13) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) [08:41:43] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) a:05Cmjohnson→03elukey [08:43:02] (03PS5) 10Muehlenhoff: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 [08:45:47] (03PS1) 10Elukey: [WIP] Assign role labs::db::wikireplica_analytics to labsdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/493653 (https://phabricator.wikimedia.org/T215231) [08:52:50] !log temporarily stop prometheus instances on prometheus2004 to take a snapshot [08:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:57] ^ will cause some UNKNOWNs in icinga [08:53:47] (03CR) 10Muehlenhoff: [C: 03+2] Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff) [08:58:08] (03PS1) 10Giuseppe Lavagetto: scap: fix my typos [puppet] - 10https://gerrit.wikimedia.org/r/493654 [09:00:03] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:11] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:21] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:27] PROBLEM - puppet last run on elastic2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:30] on it [09:00:41] PROBLEM - puppet last run on db1117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:43] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:01] PROBLEM - puppet last run on mw2288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:03] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:05] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:05] PROBLEM - puppet last run on elastic2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:09] PROBLEM - puppet last run on elastic2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:13] (03PS1) 10Muehlenhoff: Revert "Create /etc/debdeploy-autorestarts.conf which lists all automated restarts" [puppet] - 10https://gerrit.wikimedia.org/r/493655 [09:01:23] PROBLEM - puppet last run on an-worker1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:42] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert "Create /etc/debdeploy-autorestarts.conf which lists all automated restarts" [puppet] - 10https://gerrit.wikimedia.org/r/493655 (owner: 10Muehlenhoff) [09:01:49] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:51] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:51] PROBLEM - puppet last run on mw2223 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:25] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:27] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:31] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:39] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:41] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:43] PROBLEM - puppet last run on ms-be1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:45] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:47] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:51] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:53] PROBLEM - puppet last run on mc1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:53] PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:03] PROBLEM - puppet last run on ms-fe2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:03] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:03] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:05] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:15] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:15] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:15] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:15] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:15] PROBLEM - puppet last run on db2081 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:15] PROBLEM - puppet last run on aluminium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:15] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:16] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:16] PROBLEM - puppet last run on db1121 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:17] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:17] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:18] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:18] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:19] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:20] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:20] PROBLEM - puppet last run on es2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:21] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:21] PROBLEM - puppet last run on snapshot1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:22] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:22] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:23] PROBLEM - puppet last run on analytics1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:23] PROBLEM - puppet last run on ores1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:24] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:24] PROBLEM - puppet last run on proton1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:25] PROBLEM - puppet last run on elastic2048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:27] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:29] PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:31] PROBLEM - puppet last run on pc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:31] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:31] PROBLEM - puppet last run on dbproxy1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:31] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:33] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:43] PROBLEM - puppet last run on mw2287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:47] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:49] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:49] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:49] PROBLEM - puppet last run on debmonitor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:57] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:59] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:59] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:59] PROBLEM - puppet last run on rdb2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:01] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:01] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:01] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:01] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:01] PROBLEM - puppet last run on ms-be2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:05] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:05] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:07] <_joe_> whoa [09:04:13] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:17] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:17] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:17] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:17] PROBLEM - puppet last run on cloudvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:17] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:17] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:19] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:19] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:21] PROBLEM - puppet last run on lvs4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:25] PROBLEM - puppet last run on acrux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:27] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:27] PROBLEM - puppet last run on mw2274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:27] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: fix my typos [puppet] - 10https://gerrit.wikimedia.org/r/493654 (owner: 10Giuseppe Lavagetto) [09:04:29] it's fixed, but Icinga is a little slow to report :-) [09:04:29] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:31] PROBLEM - puppet last run on dns1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:31] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:31] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:31] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:34] <_joe_> moritzm: yeah I know [09:04:37] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:37] PROBLEM - puppet last run on scb2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:37] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:39] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:39] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:39] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:39] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:43] PROBLEM - puppet last run on pc2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:43] PROBLEM - puppet last run on sessionstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:43] PROBLEM - puppet last run on ores2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:43] PROBLEM - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:43] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:46] (03PS2) 10Giuseppe Lavagetto: scap: fix my typos [puppet] - 10https://gerrit.wikimedia.org/r/493654 [09:04:50] <_joe_> grr [09:04:51] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:53] PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:53] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:53] PROBLEM - puppet last run on cloudvirt1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:01] PROBLEM - puppet last run on restbase2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:03] PROBLEM - puppet last run on elastic2053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:03] PROBLEM - puppet last run on mw1322 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:03] PROBLEM - puppet last run on an-worker1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:03] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:03] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:04] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:05] <_joe_> where is icinga-wm running? [09:05:07] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:07] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:09] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:09] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:11] PROBLEM - puppet last run on db1118 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:15] PROBLEM - puppet last run on db1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:15] PROBLEM - puppet last run on an-worker1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:21] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:21] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:27] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:27] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:27] <_joe_> sorry lemme rephrase, where is the code for icinga-wm? [09:05:37] PROBLEM - puppet last run on mc2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:37] PROBLEM - puppet last run on cloudvirt1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:39] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:45] PROBLEM - puppet last run on mw2290 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:47] PROBLEM - puppet last run on dns1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:53] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:55] PROBLEM - puppet last run on mw2280 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:55] PROBLEM - puppet last run on mw2233 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:55] PROBLEM - puppet last run on logstash1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:55] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:57] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:59] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:59] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:01] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:09] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:09] PROBLEM - puppet last run on ores2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:09] PROBLEM - puppet last run on mw1345 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:11] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:11] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:17] PROBLEM - puppet last run on ms-fe2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:17] PROBLEM - puppet last run on ores2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:17] PROBLEM - puppet last run on mw2256 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:19] PROBLEM - puppet last run on db2074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:19] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:19] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:19] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:19] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:19] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:19] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:21] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:23] PROBLEM - puppet last run on proton2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:23] PROBLEM - puppet last run on mw2251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:23] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:25] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:27] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:33] PROBLEM - puppet last run on mw2265 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:33] PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:33] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:39] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:39] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:39] PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:39] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:39] PROBLEM - puppet last run on ms-be1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:45] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:06:51] PROBLEM - puppet last run on db1106 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:05] PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:07] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:07] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:41] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:41] PROBLEM - puppet last run on wtp1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:43] PROBLEM - puppet last run on cloudnet2002-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:47] PROBLEM - puppet last run on cp5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:53] PROBLEM - puppet last run on wtp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:55] PROBLEM - puppet last run on ms-fe1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:55] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:57] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:28] I've stopped irchecho, fixing puppet runs via cumin [09:14:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) dbstore1002 just crashed: ` Thread pointer: 0x0x0 Attempting backtrace. You can use the following information to find out where mysql... [09:14:47] RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:14:54] elukey: ^ dbstore1002 knows the time is arriving [09:15:27] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:16:13] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:16:21] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:16:39] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:16:39] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:16:39] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:17:27] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:17:29] RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:18:09] RECOVERY - puppet last run on cloudnet2002-dev is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:18:39] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:18:39] RECOVERY - puppet last run on wtp2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:18:41] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:18:53] RECOVERY - puppet last run on ms-be2047 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:18:53] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:18:53] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:18:55] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:18:55] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:18:55] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:19:03] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:19:03] RECOVERY - puppet last run on mc2023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:19:23] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:19:23] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:19:33] RECOVERY - puppet last run on rdb2005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:20:01] RECOVERY - puppet last run on mw2274 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:20:01] RECOVERY - puppet last run on acrux is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:20:03] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:20:10] marostegui: poor dbstore1002 [09:20:11] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:20:11] RECOVERY - puppet last run on scb2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:20:11] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:20:13] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:20:13] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:20:13] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:20:15] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:20:17] RECOVERY - puppet last run on sessionstore2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:20:33] RECOVERY - puppet last run on restbase2016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:20:33] RECOVERY - puppet last run on elastic2053 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:20:37] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:20:41] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:20:41] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:20:43] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:20:55] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:03] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:21:07] RECOVERY - puppet last run on mc2019 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:21:09] RECOVERY - puppet last run on elastic2039 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:21:15] RECOVERY - puppet last run on mw2290 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:21:25] RECOVERY - puppet last run on mw2280 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:25] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:21:41] RECOVERY - puppet last run on ores2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:21:43] RECOVERY - puppet last run on mw2288 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:45] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:21:49] RECOVERY - puppet last run on ms-fe2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:21:49] RECOVERY - puppet last run on mw2256 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:21:51] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:21:51] RECOVERY - puppet last run on db2074 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:21:51] RECOVERY - puppet last run on elastic2030 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:21:51] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:21:57] RECOVERY - puppet last run on proton2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:22:07] RECOVERY - puppet last run on mw2265 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:22:07] RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:22:07] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:22:13] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:22:37] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:22:39] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:22:43] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:23:13] PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100% [09:23:29] huh [09:23:31] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:23:41] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:23:51] RECOVERY - puppet last run on ms-fe2007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:23:53] RECOVERY - puppet last run on pc2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:01] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:01] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:01] RECOVERY - puppet last run on db2081 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:24:01] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:24:01] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:24:03] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:24:03] RECOVERY - puppet last run on es2017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:24:03] RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:24:04] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:24:13] RECOVERY - puppet last run on elastic2048 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:24:13] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:24:17] RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:21] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:24:31] RECOVERY - puppet last run on mw2287 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:35] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:24:43] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:24:45] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:49] RECOVERY - puppet last run on ms-be2032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:25:17] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:25:25] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:25:27] RECOVERY - puppet last run on pc2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:25:27] RECOVERY - puppet last run on ores2007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:25:49] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:26:07] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:26:41] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:26:59] RECOVERY - puppet last run on elastic2035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:27:01] RECOVERY - puppet last run on ores2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:27:03] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:27:03] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:27:07] RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:27:19] RECOVERY - puppet last run on an-worker1079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:31] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:27:57] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:28:27] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:28:29] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:28:33] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:28:43] RECOVERY - puppet last run on ms-fe1006 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:28:43] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:47] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:28:49] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:29:13] RECOVERY - puppet last run on aluminium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:29:13] RECOVERY - puppet last run on db1121 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:29:13] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:29:13] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:29:13] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:29:15] RECOVERY - puppet last run on pc1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:29:17] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:29:17] RECOVERY - puppet last run on snapshot1009 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:29:19] RECOVERY - puppet last run on analytics1077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:29:19] RECOVERY - puppet last run on ores1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:29:23] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:29:29] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:29:31] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:29:37] RECOVERY - puppet last run on dbproxy1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:29:47] RECOVERY - puppet last run on debmonitor1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:29:55] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:29:59] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:29:59] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:29:59] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:29:59] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:30:09] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:13] RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:30:15] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:15] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:30:15] RECOVERY - puppet last run on cloudvirt1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:15] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:30:15] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:30:17] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:30:19] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:30:19] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:19] RECOVERY - puppet last run on ms-fe1008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:30:19] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:30:25] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:29] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:30:31] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:31] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:41] RECOVERY - puppet last run on dbmonitor1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:30:41] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:47] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:30:51] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:51] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:31:01] RECOVERY - puppet last run on an-worker1082 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:01] RECOVERY - puppet last run on mw1322 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:03] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:07] RECOVERY - puppet last run on db1118 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:11] RECOVERY - puppet last run on an-worker1078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:31:11] RECOVERY - puppet last run on db1083 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:31:19] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:23] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:23] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:31:33] RECOVERY - puppet last run on cloudvirt1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:33] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:45] RECOVERY - puppet last run on db1117 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:31:47] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:31:49] RECOVERY - puppet last run on logstash1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:31:49] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:31:51] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:31:53] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:32:01] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:32:03] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:03] RECOVERY - puppet last run on mw1345 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:32:05] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:32:09] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:32:17] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:19] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:23] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:33] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:32:33] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:32:33] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:33] RECOVERY - puppet last run on ms-be1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:47] RECOVERY - puppet last run on db1106 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:33:43] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:33:43] RECOVERY - puppet last run on wtp1040 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:33:51] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:33:53] RECOVERY - puppet last run on wtp1046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:33:53] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:33:57] RECOVERY - puppet last run on ms-be1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:34:05] RECOVERY - puppet last run on mc1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:34:05] RECOVERY - puppet last run on mw1309 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:34:27] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:34:29] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:41] RECOVERY - puppet last run on pc1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:34:41] RECOVERY - puppet last run on dbproxy1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:34:45] RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:34:49] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:35:17] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:35:17] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:35:17] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:36:03] RECOVERY - puppet last run on cloudvirt1019 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:44:19] 10Operations, 10ops-codfw: elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10MoritzMuehlenhoff) [09:45:46] 10Operations, 10ops-codfw: elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Mathew.onipe) p:05Triage→03High [09:54:26] (03PS1) 10Muehlenhoff: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493659 [09:58:18] (03PS1) 10Ema: trafficserver (8.0.2-1wm1) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/493660 [10:01:26] (03CR) 10jerkins-bot: [V: 04-1] trafficserver (8.0.2-1wm1) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/493660 (owner: 10Ema) [10:22:08] (03CR) 10DCausse: [C: 03+1] cloudelastic: Add cloudelastic configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:23:45] (03PS2) 10Ema: trafficserver (8.0.2-1wm1) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/493660 [10:23:51] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/k8s/prometheus.yml] [10:26:20] that's me ^ [10:26:52] (03CR) 10jerkins-bot: [V: 04-1] trafficserver (8.0.2-1wm1) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/493660 (owner: 10Ema) [10:27:32] that's debian-glue timing out after 180s ^ [10:29:03] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:30:49] which is strange because I did set BUILD_TIMEOUT to one hour: https://github.com/wikimedia/integration-config/blob/master/zuul/parameter_functions.py#L129 [10:42:47] (03PS15) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [10:43:00] (03CR) 10Mathew.onipe: cloudelastic: Add cloudelastic configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:43:20] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban): zuul seemingly ignoring BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10ema) [10:43:31] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban): zuul seemingly ignoring BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10ema) p:05Triage→03Normal [10:47:24] (03CR) 10DCausse: [C: 03+1] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:59:28] (03PS1) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [11:00:07] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:03:25] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493652 (owner: 10Marostegui) [11:04:36] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493652 (owner: 10Marostegui) [11:04:48] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493652 (owner: 10Marostegui) [11:05:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1094 (duration: 00m 50s) [11:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:06] RECOVERY - ensure kvm processes are running on labvirt1008 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm [11:17:32] !log rebooting labstore2004.codfw.wmnet [11:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:06] (03PS1) 10Elukey: hadoop: move ssl configs rendering out of hadoop.pp [puppet/cdh] - 10https://gerrit.wikimedia.org/r/493668 [11:20:40] (03Abandoned) 10Elukey: hadoop: move ssl configs rendering out of hadoop.pp [puppet/cdh] - 10https://gerrit.wikimedia.org/r/493668 (owner: 10Elukey) [11:20:53] (03PS1) 10Alexandros Kosiaris: Add citoid specific statsd mappings [deployment-charts] - 10https://gerrit.wikimedia.org/r/493669 (https://phabricator.wikimedia.org/T213194) [11:20:57] (03PS1) 10Alexandros Kosiaris: Publish citoid 0.0.2 version [deployment-charts] - 10https://gerrit.wikimedia.org/r/493670 (https://phabricator.wikimedia.org/T213194) [11:41:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/493659 (owner: 10Muehlenhoff) [11:42:22] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) To test the new plan above I've started an rsync + migration of all instances of prometheus2003, starting from a snapshot of data from pr... [11:47:46] !log rebooting labsdb1005.codfw.wmnet [11:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:16] (03PS1) 10KartikMistry: WIP: Enable ExternalGuidance to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493672 (https://phabricator.wikimedia.org/T216129) [11:56:07] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [11:57:17] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [11:58:31] PROBLEM - mysqld processes on labsdb1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [11:58:57] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [11:58:58] expected looks like, jbond42 [11:59:09] investigating now godog [11:59:13] got the page... [11:59:19] labsdb1005 ? [11:59:22] uh huh [11:59:35] (03Restored) 10Elukey: hadoop: move ssl configs rendering out of hadoop.pp [puppet/cdh] - 10https://gerrit.wikimedia.org/r/493668 (owner: 10Elukey) [11:59:47] RECOVERY - mysqld processes on labsdb1005 is OK: PROCS OK: 1 process with command name mysqld [11:59:51] it was just rebooted for https://phabricator.wikimedia.org/T216802 [11:59:56] seems mysql didn't start [12:00:02] yeah that would have been it, if it wasn't downtimed [12:00:39] it was downtimed in icinga but mysql didn't start when it came back up [12:00:44] hm [12:01:17] were all services on the host downtimed as well? [12:01:30] (trying to figure out why it paged) [12:01:55] (03CR) 10Elukey: [C: 03+2] hadoop: move ssl configs rendering out of hadoop.pp [puppet/cdh] - 10https://gerrit.wikimedia.org/r/493668 (owner: 10Elukey) [12:02:00] the downtime had finished becase the host came back up. the alert was valid. when the box came back up mysql and maradb was not started. i had to start them manully [12:03:00] it seems jynus manually killed mysql on this server on 2019-02-18 according to the SAL [12:04:48] was it meant to remain not running? [12:04:59] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Pr [12:04:59] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) [12:05:20] (03CR) 10Mathew.onipe: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1002/14938/" [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:05:27] not sure apergos, jynus should know more (or brooke, but is the middle of the night for her) [12:05:33] ok [12:05:54] it was not running at the time of the reboot? [12:06:28] mysql doesn't start automatically on reboot [12:06:34] that is a feature, not a bug [12:06:36] ah there's the answer, thank you [12:06:40] :-) [12:06:53] if you don't like it, you can configure the class to do so [12:06:53] thanks jynus [12:06:54] so the next question is about avoiding pages on reboot, for that service [12:07:06] but a) don't set it as default b) I don't recommend it [12:07:25] but you are free to do I think is autostart=1 parameter [12:07:35] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a r [12:07:35] ved [12:08:32] $ensure = stopped is the paramter, on the mariadb::service [12:08:52] + managed = true [12:09:17] manage, not managed [12:09:51] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Pr [12:09:51] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) [12:13:31] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [12:16:54] (03CR) 10Muehlenhoff: Add ability to filter out auto restarts (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 (owner: 10Jbond) [12:18:33] PROBLEM - configured eth on proton1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.61: Connection reset by peer [12:19:39] RECOVERY - configured eth on proton1002 is OK: OK - interfaces up [12:19:43] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobi [12:19:43] ed the unexpected status 500 (expecting: 200) [12:22:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:22:52] (03PS1) 10Jbond: Remove unused libraries and use collections.defaultdict [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493675 [12:22:54] * akosiaris looking into proton [12:23:45] ps auxww |grep chromium |wc -l [12:23:45] 32 [12:23:47] gulp [12:24:05] either someone decided to pdfize a ton of articles or there's a bug [12:24:33] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [12:24:42] proton1001 is even better... 99 chromium processes [12:24:43] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:25:11] ah great... stuck from Jan 30 [12:25:13] perfect [12:26:55] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [12:31:44] !log restart proton on proton1001, counted 99 chromium processes left running since at least Jan 30 [12:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] !log restart proton1002, OOM showed up [12:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:24] https://grafana.wikimedia.org/d/000000563/proton?orgId=1 [12:35:33] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:35:42] hmm indeed someone is creating a lot of pdfs [12:36:45] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:37:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493675 (owner: 10Jbond) [12:38:01] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban): zuul seemingly ignoring BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10hashar) The job has a BUILD_TIMEOUT parameter that defaults to 30 (minutes). That is configured in the job itself (as w... [12:38:47] seems like someone is trying to pdfize large parts of de.wikipedia.org [12:40:15] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - proton_24766: Servers proton1002.eqiad.wmnet are marked down but pooled [12:40:27] PROBLEM - LVS HTTP IPv4 on proton.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:44] uh [12:41:03] yeah big incoming traffic for pdfs [12:41:11] proton seems to not be able to keep up with the rate [12:41:35] RECOVERY - LVS HTTP IPv4 on proton.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 951 bytes in 0.080 second response time [12:41:35] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [12:42:07] any way to throttle them? [12:42:13] that's what I am searching [12:42:24] supposedly proton has some queue but ... [12:42:38] it does return queue is full but still [12:44:15] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban): zuul seemingly ignoring BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10hashar) I have refreshed the jenkins job just in case but it still comes hardcoded to 3 minutes apparently ;-( [12:44:31] I 'll silence the paging alert just to avoid more pages for the next couple of hours [12:44:38] I 'll leave the rest of the alerts as is however [12:45:00] f it can't keep up with the queue once it's full, maybe the queue needs to be shorter (so more requests are rejected) [12:45:57] the config says it's 3 [12:46:01] whatever that 3 means [12:46:32] render_concurrency: 3 [12:46:32] render_queue_timeout: 60 [12:46:33] render_execution_timeout: 90 [12:46:33] max_render_queue_size: 50 [12:46:33] ugh [12:46:45] queue_size? maybe that? [12:46:49] but who knows [12:47:08] I can lower it, but supposedly with a render_concurrency of 3 we should not be having this problem [12:47:15] unless those things are really badly named [12:47:21] anyway, sure, I 'll drop it to 20 [12:49:58] !log lower max_render_queue_size: to 20 for proton on proton100{1,2} [12:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:28] <_joe_> maybe we can rate-limit per ip at the varnish layer? [12:50:31] <_joe_> ema: ^^ [12:50:36] we already do IIRC [12:50:45] <_joe_> but specifically for pdfs [12:50:49] <_joe_> to like 1 per second [12:50:54] although it's almost certainly more than what this endpoint can survive [12:50:56] yeah sure [12:51:21] <_joe_> [12:52:34] The maximum number of simultaneous requests the server can render successfully is `max_render_queue_size + render_concurrency`. (from the docs) [12:52:42] guess we'll see what happens [12:53:49] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Pr [12:53:49] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) [12:57:27] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [12:57:39] <_joe_> akosiaris: want me to look into rate-limiting in varnish? [12:58:02] _joe_: we can do that, we'll have to publish the IP though? [12:58:26] <_joe_> ema: we want to limit the pdf creation urls I guess, not a specific ip [12:58:36] I think I can get them blocked at the router level as well [12:58:49] <_joe_> if we want to just ban one ip, sure [12:58:59] _joe_: smart! :) [12:58:59] <_joe_> if we have a specific UA, even better [12:59:24] <_joe_> ema: my idea was rate-limit to something like 1 request/IP/second [12:59:56] <_joe_> but now that I think about it, what would we block? the url is localized IIRC [13:00:00] <_joe_> :/ [13:00:48] this /api/rest_v1/page/pdf/Esther_Sunday [13:00:50] <_joe_> Special:Book I mean [13:00:59] no, it's over the restbase API [13:01:13] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [13:01:32] <_joe_> oh ok [13:01:40] <_joe_> rb has its own ratelimiting IIRC? [13:01:53] so now we are under some minor control [13:02:05] yes rb does have its own rate limiter [13:02:05] I 've lowered both settings enough to allow the box to continue existing [13:02:16] <_joe_> akosiaris: ok [13:02:20] if (vsthrottle.is_denied("rest:" + req.http.X-Client-IP, 1000, 10s)) [13:02:23] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [13:02:36] <_joe_> ema: no I meant inside the software [13:02:50] <_joe_> rb can do concurrency limits [13:02:57] <_joe_> to backends [13:04:50] ok if my back of the envelope calculations are correct there is an austrian IP having done some 36k requests to pdf restbase api since 06:27 this morning with probably the bulk of it happening since 11:00 [13:04:53] all times UTC [13:05:09] ouch! [13:05:31] (03CR) 10Mathew.onipe: "some comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [13:05:42] blocking them directly (a 404 with body that explains why) would be nice if it were doable [13:05:48] *403 [13:05:58] which is about 5 req/s so well without the global rate limits [13:06:14] within* [13:06:43] per https://grafana.wikimedia.org/d/000000563/proton?orgId=1&from=now-3h&to=now they 've up to at a max of 20 [13:06:43] <_joe_> A 429 maybe gentler [13:07:11] sure [13:07:17] now they are averaging on the 5req/s [13:07:33] 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10Pginer-WMF) >>! In T212197#4991170, @dr0ptp4kt wrote: > What's our latest read on the releas... [13:08:07] I guess it's a ua that looks like a browser? [13:08:28] good q, looking [13:08:31] no contact info inside the string or anything like that (script best practices as we recommend) [13:08:32] ? [13:08:50] (03PS1) 10Ema: varnish: rate limit proton [puppet] - 10https://gerrit.wikimedia.org/r/493683 [13:08:53] (03CR) 10Muehlenhoff: [C: 03+2] Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493659 (owner: 10Muehlenhoff) [13:09:19] "user_agent":"-" [13:09:21] ok it's a bot [13:09:25] (03PS2) 10Ema: varnish: rate limit proton [puppet] - 10https://gerrit.wikimedia.org/r/493683 [13:09:29] bockblockblock [13:09:52] "uri_path":"/api/rest_v1/page/pdf/Liste_der_h/u00f6chsten_Bauwerke_in_Sierra_Leone","uri_query":"","content_type":"application/problem+json","referer":"-","user_agent":"-","accept_language":"-" [13:09:58] yeah definitely a bot [13:10:36] if we're interruptng work by some community member or a researcher, they can find an email or irc channel info and come ask [13:11:04] (03PS3) 10Ema: varnish: rate limit proton [puppet] - 10https://gerrit.wikimedia.org/r/493683 [13:12:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] varnish: rate limit proton [puppet] - 10https://gerrit.wikimedia.org/r/493683 (owner: 10Ema) [13:12:42] ema, that's 10 req in 10 seconds? [13:12:56] yeah so 1req/s [13:12:57] apergos: yes [13:13:04] 👍 [13:13:37] right, with 10 burst [13:13:47] hm [13:14:00] how is burst calculated? [13:15:27] the idea as I understand it is that you can perform 10 req in 10s, so it's fine to perform 10 requests in 1s, but then for the remaining 9s you're rate limited [13:15:51] (03PS2) 10Jbond: Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 [13:17:03] good if it works [13:17:20] akosiaris: ok to merge or do you want to change anything? [13:17:23] I mean I don't love a burst of 10, if that could be capped too it would be better [13:17:54] ema: let's see how it works [13:18:07] cause it sounds ok in principle [13:18:19] (03CR) 10Ema: [C: 03+2] varnish: rate limit proton [puppet] - 10https://gerrit.wikimedia.org/r/493683 (owner: 10Ema) [13:18:58] https://grafana.wikimedia.org/d/000000563/proton?orgId=1&from=now-30m&to=now [13:19:04] hmm maybe they 've paused [13:20:12] (03CR) 10DCausse: [WIP] Add support for elasticsearch 6 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [13:20:32] that would be fine too [13:22:29] ema: let me know when the change has fully propagate so I can revert the configs in their original settings in the proton side [13:24:04] akosiaris: either the usual 30m or I can cumin a puppet agent run if you wish [13:24:30] !log removed sca* hosts from debmonitor database [13:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:38] akosiaris: in that case, is esams enough or was the IP non-EU? [13:24:51] esams is enough I 'd say [13:26:11] in any case things have returned to normal [13:26:43] akosiaris: maybe those settings shouldn't be reverted? [13:26:56] lol [13:27:14] could be, but it should be after a discussion with the team owning it [13:27:41] if the settings are such that proton can now keep up with its queue when there are a larger number of incming requests, maybe that's what we want [13:28:08] sure, but we don't know if it was the settings change or the bot just quit [13:28:20] hm you have a point [13:28:45] guess this needs to be a task, meh [13:28:47] akosiaris: change fully applied to text_esams [13:28:53] ema: cool, thanks [13:29:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban): zuul seemingly ignoring BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10hashar) I created a job from scratch in jjb: ` - job: name: build-timeout-jjb node: contint1001 parameters:... [13:29:16] I 'll monitor this for the next hour or so [13:29:24] I haven't yet reverted the settings anyway [13:30:45] apergos: yeah it needs to be a task. I 'll file one, but after lunch [13:30:57] (03PS1) 10Mforns: Add timer to delete analytics EL unsanitized events after 90d [puppet] - 10https://gerrit.wikimedia.org/r/493687 (https://phabricator.wikimedia.org/T209503) [13:31:01] enjoy! (your lunch) [13:39:35] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban), 10Upstream: zuul seemingly ignoring BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10hashar) The root cause is JJB tries to fetch informations for a plugin named `Jenkins build timeout plugi... [13:40:57] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban), 10Upstream: Jenkins job builder ignores BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10hashar) [14:00:58] akosiaris: I do see a bunch of requests being rate-limited now [14:02:18] (03CR) 10Herron: "Thanks! Normally I'd agree on waiting until Monday, but in this case the patch will persist the config change that was made to address the" [puppet] - 10https://gerrit.wikimedia.org/r/493610 (https://phabricator.wikimedia.org/T200960) (owner: 10Herron) [14:10:20] (03CR) 10Mathew.onipe: [WIP] Add support for elasticsearch 6 (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [14:15:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban), 10Upstream: Jenkins job builder ignores BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10hashar) [14:22:14] ema: ah nice! [14:22:16] thanks! [14:22:31] I do see a minor bump in requests but nothing alarming [14:26:12] (03CR) 10Mathew.onipe: [WIP] Add support for elasticsearch 6 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [14:26:47] (03CR) 10CDanis: Add ganeti->netbox sync script (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [14:33:39] !log Updating all debian-glue Jenkins job to properly take in account the BUILD_TIMEOUT parameter # T217403 [14:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:43] T217403: Jenkins job builder ignores BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 [14:36:38] (03PS1) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs [puppet] - 10https://gerrit.wikimedia.org/r/493693 [14:38:12] 10Operations, 10Traffic: Indexing of https://www.wikidata.org in the Yandex Search Engine - https://phabricator.wikimedia.org/T217407 (10Anomie) This has nothing to do with the API itself, it's a question about use of the API. So I'm going to remove #mediawiki-api. It may be that the best way to index wikidata... [14:38:39] (03CR) 10Hashar: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/493660 (owner: 10Ema) [14:39:00] hashar: thanks for working on the timeout thing :) [14:39:12] I am surprised we havent had the issue before :/ [14:39:18] RECOVERY - nova-compute proc maximum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [14:39:46] does the majority of our software build in < 3 minutes? [14:44:21] seeing that bug.. it looks like it ;P [14:44:41] (03CR) 10Mathew.onipe: Add cookbook for elastic6 upgrade (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [14:48:27] ema: https://integration.wikimedia.org/ci/job/debian-glue/1446/ it works :) [14:48:57] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban), 10Upstream: Jenkins job builder ignores BUILD_TIMEOUT - https://phabricator.wikimedia.org/T217403 (10hashar) 05Open→03Resolved It works ! :) [14:51:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/493610 (https://phabricator.wikimedia.org/T200960) (owner: 10Herron) [14:52:03] (03PS1) 10Muehlenhoff: Allow filtering services for restart notification (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493697 [14:54:06] hashar_: \o/ [14:58:55] (03CR) 10Muehlenhoff: Add ability to filter out auto restarts (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 (owner: 10Jbond) [15:02:25] !log restore proton config values [15:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:47] (03CR) 10DCausse: Add cookbook for elastic6 upgrade (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:14:05] (03CR) 10Jbond: [V: 03+2 C: 03+1] Remove unused libraries and use collections.defaultdict [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493675 (owner: 10Jbond) [15:14:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] Remove unused libraries and use collections.defaultdict [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493675 (owner: 10Jbond) [15:14:27] (03PS2) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) [15:17:35] (03PS3) 10Jbond: Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 [15:17:47] (03PS3) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) [15:18:56] (03PS2) 10Herron: logstash: disable persisted queue [puppet] - 10https://gerrit.wikimedia.org/r/493610 (https://phabricator.wikimedia.org/T200960) [15:20:29] (03PS4) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) [15:20:48] (03CR) 10Herron: [C: 03+2] logstash: disable persisted queue [puppet] - 10https://gerrit.wikimedia.org/r/493610 (https://phabricator.wikimedia.org/T200960) (owner: 10Herron) [15:21:45] (03PS4) 10Jbond: Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 [15:21:55] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Logstash packet loss - https://phabricator.wikimedia.org/T200960 (10CDanis) FTR: Twice in two months we've seen all the logstashen in one cluster 'lock up' at around the same time: stop processing incoming events, huge backlog of socket recv-Q bytes, J... [15:22:24] (03PS5) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) [15:23:57] (03CR) 10Ema: [C: 03+2] trafficserver (8.0.2-1wm1) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/493660 (owner: 10Ema) [15:32:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [15:33:46] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/14945/ seems a no-op for the current code, need to add secrets to labs_private and see ho" [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [15:38:48] !log trafficserver_8.0.2-1wm1 uploaded to stretch-wikimedia [15:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:36] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Verified locally that this builds." [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/493311 (owner: 10Paladox) [15:56:05] (03CR) 10Paladox: [V: 03+2 C: 03+2] "This builds upstream so is verified." [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493412 (owner: 10Paladox) [15:58:50] (03CR) 10Paladox: [C: 03+1] "Im seeing" [puppet] - 10https://gerrit.wikimedia.org/r/493317 (https://phabricator.wikimedia.org/T217287) (owner: 10Thcipriani) [16:06:22] (03PS4) 10Esanders: VE: Enable true section editing for mobile on labswiki & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) [16:15:15] (03PS2) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [16:15:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:27:47] (03CR) 10Muehlenhoff: [C: 03+1] sudo: use validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/492718 (owner: 10Giuseppe Lavagetto) [16:33:49] !log rolling security update of bind9 packages on jessie and trusty [16:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:56] (03PS3) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [16:38:06] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:39:53] PROBLEM - DPKG on restbase1011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:42:17] RECOVERY - DPKG on restbase1011 is OK: All packages OK [16:45:00] 10Operations, 10Traffic: Indexing of https://www.wikidata.org in the Yandex Search Engine - https://phabricator.wikimedia.org/T217407 (10Reedy) >Downloading the wikidata dumps might not help in this situation as we need to crawl pages a user sees them. Noting they're wanting to crawl the user facing pages (wh... [16:57:24] (03PS1) 10Bstorm: wikireplicas: correct join for logging_compat [puppet] - 10https://gerrit.wikimedia.org/r/493718 (https://phabricator.wikimedia.org/T212972) [16:59:58] (03CR) 10Bstorm: [C: 03+2] wikireplicas: correct join for logging_compat [puppet] - 10https://gerrit.wikimedia.org/r/493718 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [17:01:51] (03PS5) 10Jbond: Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 [17:02:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Fine by me then" [deployment-charts] - 10https://gerrit.wikimedia.org/r/493444 (https://phabricator.wikimedia.org/T206785) (owner: 10Ottomata) [17:06:44] (03PS1) 10Jbond: Remove if statement as we now use defaultdict [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493720 [17:07:05] (03CR) 10Bartosz Dziewoński: [C: 04-1] "I think ‘labswiki’ is wikitech.wikimedia.org, not Beta Cluster wikis. Not sure if this is what you want?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) (owner: 10Esanders) [17:08:04] (03CR) 10Reedy: "Yeah, labswiki == wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) (owner: 10Esanders) [17:10:47] (03CR) 10Jbond: Add ability to filter out auto restarts (034 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 (owner: 10Jbond) [17:23:03] (03CR) 10CRusnov: "Thank you for the review!" (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:25:37] (03PS1) 10Marostegui: db-eqiad.php: Update pc1007 rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493722 [17:31:09] 10Operations, 10ops-eqiad: Update pc1007,pc1010 status on netbox - https://phabricator.wikimedia.org/T217429 (10Marostegui) [18:21:32] 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Thanks @Pginer-WMF. I've put a HOLD on the calendar for March 6 to get the Varnis... [18:32:51] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:33:01] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:33:45] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:33:45] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:33:59] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:34:03] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:35:25] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:37:09] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:45:43] PROBLEM - SSH on notebook1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:45:52] hrmmm [18:45:55] was that planned? [18:46:47] RECOVERY - SSH on notebook1003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [18:48:08] =/ [18:53:41] !log notebook1003 has unusually high load recently (23) and seemed to lag in reporting to icinga. no hardware failures, pinged about it in #wikimedia-analytics [18:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:41] PROBLEM - IPMI Sensor Status on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [19:12:54] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10ssingh) [19:14:20] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10ssingh) [19:16:03] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [19:16:05] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:16:07] RECOVERY - DPKG on notebook1003 is OK: All packages OK [19:16:17] RECOVERY - Disk space on notebook1003 is OK: DISK OK [19:16:59] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [19:16:59] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [19:17:23] !log pre-configure asw-a5 ports on asw2-a5-eqiad - T187960 [19:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:27] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [19:18:41] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:29:01] !log pre-configure asw-a6 ports on asw2-a6-eqiad - T187960 [19:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:04] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [19:31:01] (03PS10) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [19:31:05] (03CR) 10Aaron Schulz: "If an actual error occurred on the DC-local mc server, then the "worst" reply would be an error code rather than NOT_STORED ( https://gith" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [19:31:17] (03PS4) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [19:31:25] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:31:30] (03CR) 10Jcrespo: "Still needs more work." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:31:53] RECOVERY - IPMI Sensor Status on notebook1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:32:22] (03PS5) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [19:32:24] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:32:37] !log pre-configure asw-a7 ports on asw2-a7-eqiad - T187960 [19:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:59] (03PS6) 10Jcrespo: mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) [19:33:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor dump_section.py and rename to match functionality [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/493664 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [19:34:12] (03CR) 10Esanders: "is there a beta cluster group?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) (owner: 10Esanders) [19:35:49] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Fri 2019-03-01 19:35:48 UTC. [19:40:09] !log pre-configure asw-a8 ports on asw2-a8-eqiad - T187960 [19:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:13] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [19:49:23] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) a:03Cmjohnson [20:01:32] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [20:13:35] (03PS1) 10Sbisson: Enable and configure the ORES goodfaith model on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493749 (https://phabricator.wikimedia.org/T211032) [20:28:28] (03PS1) 10Tpt: Enables maplink for geocoordinate Wikibase statements display on clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493753 (https://phabricator.wikimedia.org/T217442) [20:41:11] (03PS5) 10Esanders: VE: Enable true section editing for mobile on labs & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) [20:42:43] (03CR) 10Andrew Bogott: "Is the current plan that each postgres-using project will have its own postgres server? And/or is wikilabels currently the only postgres " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [20:45:32] (03CR) 10Bstorm: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [20:46:19] (03CR) 10Bstorm: wikilabels: stage the postgres roles for virtualizing the database (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [20:47:07] (03CR) 10Andrew Bogott: [C: 03+1] "ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [20:49:13] (03CR) 10Aaron Schulz: [C: 03+1] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [20:52:59] (03PS3) 10Bstorm: wikilabels: stage the postgres roles for virtualizing the database [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) [20:55:55] (03CR) 10Bstorm: [C: 03+2] wikilabels: stage the postgres roles for virtualizing the database [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [21:13:28] (03PS2) 10CDanis: partman: grub-install on all RAID{1,10} drives [puppet] - 10https://gerrit.wikimedia.org/r/490404 (https://phabricator.wikimedia.org/T215183) [21:15:14] 10Operations, 10SRE-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10bmansurov) [21:15:24] 10Operations, 10SRE-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10bmansurov) [21:27:25] (03CR) 10Bartosz Dziewoński: "I think we don't need the $wmgFoo variable if it's just an alias for $wgFoo. Just set 'wgFoo' directly in InitialiseSettings. I struggle t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) (owner: 10Esanders) [21:44:15] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:54:55] 10Operations, 10ops-eqiad: Update pc1007,pc1010 status on netbox - https://phabricator.wikimedia.org/T217429 (10ayounsi) Mentioning dbproxy1013 here in case it's a similar case https://netbox.wikimedia.org/dcim/devices/1550/ [22:16:27] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [22:36:27] (03CR) 10Esanders: "can we fix that as tech debt. I'm just copying the existing style" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) (owner: 10Esanders) [22:37:50] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10herron) [22:37:58] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10herron) [22:48:43] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10RobH) [22:49:46] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10RobH) a:03Nuria So for this to be approved, we need the approval of the analytics team (since they manage the server.) We'll also need them to tell us exactly what groups to include. [22:51:13] 10Operations, 10LDAP-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10RobH) [23:22:02] (03PS1) 10BryanDavis: wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) [23:23:11] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [23:23:57] (03PS1) 10Bstorm: osmdb: stage the roles and profiles for virtualizing the servers [puppet] - 10https://gerrit.wikimedia.org/r/493769 (https://phabricator.wikimedia.org/T193264) [23:26:10] (03PS2) 10BryanDavis: wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) [23:27:00] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [23:27:23] PROBLEM - puppet last run on doc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:30] (03PS3) 10BryanDavis: wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) [23:41:28] (03CR) 10Bartosz Dziewoński: [C: 03+1] "Hm, yeah, fair enough. I hadn't looked at the file outside of the diff, but it seems like we do the same for other VisualEditor config var" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) (owner: 10Esanders) [23:53:17] RECOVERY - puppet last run on doc1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [23:54:23] (03CR) 10Catrope: [C: 03+1] Enable and configure the ORES goodfaith model on itwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493749 (https://phabricator.wikimedia.org/T211032) (owner: 10Sbisson)