[00:00:04] <jouncebot>	 Deploy window No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190613T0000)
[00:24:48] <wikibugs>	 (03PS1) 10Faidon Liambotis: dsa-check-hpssacli: import latest version from DSA [puppet] - 10https://gerrit.wikimedia.org/r/516724
[00:24:50] <wikibugs>	 (03PS1) 10Faidon Liambotis: dsa-check-hpssacli: speed when checking many disks [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723)
[00:24:52] <wikibugs>	 (03PS1) 10Faidon Liambotis: dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726
[00:41:24] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10faidon) So, the timeout patch above bumped the timeouts to 100s I think. On many hosts (e.g. ms-be1036, ms-be103...
[00:43:58] <wikibugs>	 (03PS2) 10Faidon Liambotis: dsa-check-hpssacli: refactor for speed/efficiency [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723)
[00:44:00] <wikibugs>	 (03PS2) 10Faidon Liambotis: dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726
[00:45:28] <paravoid>	 !log setting the CPU governor to performance for ms-be1036 (a while ago)
[00:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:00] <wikibugs>	 (03PS1) 10Paladox: Merge remote-tracking branch 'upstream/v2.15.14' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/516727
[01:25:17] <icinga-wm>	 PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[01:38:49] <icinga-wm>	 PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[01:52:27] <icinga-wm>	 RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[02:11:27] <icinga-wm>	 RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[04:42:41] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:02:37] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:11:19] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational
[05:16:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational
[06:30:35] <icinga-wm>	 PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R]
[06:31:59] <icinga-wm>	 PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh]
[06:34:19] <icinga-wm>	 PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[06:46:10] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10WMDE-Fisch)
[06:50:48] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "This is good with the caveat that we still need a way to prevent unhappy disks from flapping." [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) (owner: 10Filippo Giunchedi)
[06:55:49] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: More traffic db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516738
[06:57:47] <icinga-wm>	 RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "+1 and also labsdb1010 has caught up with replication :)" [puppet] - 10https://gerrit.wikimedia.org/r/516639 (owner: 10Jcrespo)
[06:58:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516738 (owner: 10Marostegui)
[06:59:07] <icinga-wm>	 RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:59:31] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: More traffic db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516738 (owner: 10Marostegui)
[06:59:50] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: More traffic db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516738 (owner: 10Marostegui)
[07:00:35] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1077 after recovering from a crash (duration: 00m 50s)
[07:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:27] <icinga-wm>	 RECOVERY - puppet last run on mw2285 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:11:42] <wikibugs>	 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694 (10Mathew.onipe)
[07:11:55] <wikibugs>	 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694 (10Mathew.onipe) p:05Triage→03Normal
[07:19:11] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10mobrovac)
[07:21:05] <wikibugs>	 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10ArielGlenn) p:05Triage→03Normal
[07:21:54] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10ArielGlenn) p:05Triage→03High
[07:22:31] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10ArielGlenn) p:05Triage→03Normal
[07:23:32] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Verify that all mailman mailing lists have private_roster=2 - https://phabricator.wikimedia.org/T225269 (10ArielGlenn) p:05Triage→03Normal
[07:23:47] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10ArielGlenn) p:05Triage→03Normal
[07:25:38] <wikibugs>	 10Operations, 10Traffic, 10HTTPS: en.wikipedia.com [sic] serves an invalid certificate - https://phabricator.wikimedia.org/T214253 (10ArielGlenn)
[07:25:40] <wikibugs>	 10Operations: wikipedia.com has invalid certificate - https://phabricator.wikimedia.org/T225650 (10ArielGlenn)
[07:27:33] <wikibugs>	 10Operations, 10serviceops, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10mobrovac)
[07:32:18] <wikibugs>	 10Operations, 10serviceops, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10ArielGlenn) p:05Triage→03Normal
[07:41:10] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Marostegui)
[07:45:31] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10awight) 05Open→03Declined I'm having second thoughts about this request, because I'm no longer see that I'l...
[07:48:31] <wikibugs>	 10Operations, 10SRE-Access-Requests: Typo in workboard column name: "Confirmation" - https://phabricator.wikimedia.org/T225696 (10awight)
[07:52:06] <wikibugs>	 (03PS2) 10Awight: New configuration to pull from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007)
[07:55:49] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ayounsi) >>! In T224603#5243200, @Papaul wrote: > @ayounsi I am planning on installing those new servers in row c and row D and I don't have the "interface-range ganeti...
[08:05:04] <wikibugs>	 (03PS3) 10Awight: New configuration to pull sitelinks from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007)
[08:05:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] labsdb: Move labsdb1010 from analytics to web to ease the extra load [puppet] - 10https://gerrit.wikimedia.org/r/516639 (owner: 10Jcrespo)
[08:09:24] <jynus>	 !log reloading proxies for wikireplicas to rebalance load
[08:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:51] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] New configuration to pull sitelinks from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007) (owner: 10Awight)
[08:36:53] <wikibugs>	 (03PS7) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072)
[08:37:23] <wikibugs>	 (03CR) 10Mathew.onipe: Add maps reboot cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[08:38:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[08:43:02] <wikibugs>	 10Operations, 10SRE-Access-Requests: Typo in workboard column name: "Confirmation" - https://phabricator.wikimedia.org/T225696 (10Aklapper) Thanks. Meh, I cannot edit that column because "Members of the project "acl*sre-team" can take this action."...
[08:44:54] <wikibugs>	 (03CR) 10Mathew.onipe: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[08:46:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[08:50:19] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10fgiunchedi) a:03fgiunchedi I'll be looking into renewing this key
[08:50:39] <wikibugs>	 (03PS1) 10Jcrespo: labsdb: Setup labsdb1010 as a web wikireplica [puppet] - 10https://gerrit.wikimedia.org/r/516749
[08:55:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] labsdb: Setup labsdb1010 as a web wikireplica [puppet] - 10https://gerrit.wikimedia.org/r/516749 (owner: 10Jcrespo)
[08:55:18] <wikibugs>	 (03PS1) 10DCausse: [cirrus] Use correct factory declaration for EntityFullTextQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516750 (https://phabricator.wikimedia.org/T216429)
[08:56:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Use correct factory declaration for EntityFullTextQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516750 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse)
[08:56:47] <wikibugs>	 (03CR) 10Mathew.onipe: "pylint is failing to run causing build to fail." [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[08:57:00] <onimisionipe>	 volans ^
[08:57:06] <onimisionipe>	 can you take a look pls
[08:57:58] <wikibugs>	 (03PS2) 10DCausse: [cirrus] Use correct factory declaration for EntityFullTextQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516750 (https://phabricator.wikimedia.org/T216429)
[08:58:04] <volans>	 onimisionipe: sure, I'll try sometime today between sessions
[08:58:15] <onimisionipe>	 Ok. thanks1
[09:10:31] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman
[09:11:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: releases: update expired gpg key [puppet] - 10https://gerrit.wikimedia.org/r/516752 (https://phabricator.wikimedia.org/T225601)
[09:11:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases: update expired gpg key [puppet] - 10https://gerrit.wikimedia.org/r/516752 (https://phabricator.wikimedia.org/T225601) (owner: 10Filippo Giunchedi)
[09:15:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Looks like CI is looking into the file itself to check for python shebang, but doesn't like binary files (rake's setup_python_extensions)." [puppet] - 10https://gerrit.wikimedia.org/r/516752 (https://phabricator.wikimedia.org/T225601) (owner: 10Filippo Giunchedi)
[09:23:31] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman
[09:28:16] <wikibugs>	 (03PS10) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291)
[09:32:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: releases: update expired gpg key [puppet] - 10https://gerrit.wikimedia.org/r/516752 (https://phabricator.wikimedia.org/T225601)
[09:32:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases: update expired gpg key [puppet] - 10https://gerrit.wikimedia.org/r/516752 (https://phabricator.wikimedia.org/T225601) (owner: 10Filippo Giunchedi)
[09:33:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] releases: update expired gpg key [puppet] - 10https://gerrit.wikimedia.org/r/516752 (https://phabricator.wikimedia.org/T225601) (owner: 10Filippo Giunchedi)
[09:37:49] <wikibugs>	 (03PS1) 10Matthias Mullie: Consistent beta wikidata urls, without www [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516753
[09:42:36] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] "Tentative CR-1" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516753 (owner: 10Matthias Mullie)
[09:42:59] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team, 10Patch-For-Review: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10fgiunchedi) Ok this should be done now, the new...
[09:47:58] <wikibugs>	 (03PS11) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291)
[09:48:40] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team, 10Patch-For-Review: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10fgiunchedi) Instructions at https://wikitech.wi...
[10:02:47] <wikibugs>	 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org: Investigate issues with wikitech-static.wikimedia.org - https://phabricator.wikimedia.org/T156570 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn The search listed as the second issue now works fine.  The google result listed as the first issue...
[10:05:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10RobH) p:05Triage→03Normal
[10:05:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10RobH)
[10:06:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10RobH)
[10:06:30] <wikibugs>	 10Operations, 10wikitech.wikimedia.org: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640 (10ArielGlenn) This has since been set to standalone, and new certs were generated. See T204840#5243222 for the context. Should this task remain open?
[10:13:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] dsa-check-hpssacli: import latest version from DSA [puppet] - 10https://gerrit.wikimedia.org/r/516724 (owner: 10Faidon Liambotis)
[10:14:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726 (owner: 10Faidon Liambotis)
[10:18:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui)
[10:19:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui)
[10:26:19] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10SCB, 10Services (watching): Upgrade python-service-checker across the fleet - https://phabricator.wikimedia.org/T225707 (10mobrovac)
[10:29:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Very nice! LGTM from a perl-untrained eye. Another good target for testing I think would be WMCS boxes and DBs which have different raid c" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis)
[10:31:00] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow installation of new dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/516758 (https://phabricator.wikimedia.org/T225704)
[10:36:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) I updated the task description, but for the record:  Racking Proposal: Install one per row. If possible, avoid installing them in the same rack of...
[10:41:54] <wikibugs>	 10Operations, 10Operations-Software-Development: Error while checking binary files for python shebang - https://phabricator.wikimedia.org/T225710 (10fgiunchedi)
[10:43:04] <wikibugs>	 (03PS1) 10Cparle: Add 'sms' and 'smn' langcodes to commons for use in captions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516760 (https://phabricator.wikimedia.org/T222309)
[10:46:07] <wikibugs>	 10Operations, 10Traffic, 10Core Platform Team Backlog (Designing), 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 6 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac)
[10:47:46] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10StevenCrossin)
[10:53:58] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+1] "LGTM, will deploy next week" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516760 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle)
[10:58:29] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[11:01:21] <wikibugs>	 10Operations, 10media-storage: CPU scaling governor on ms-be hosts - https://phabricator.wikimedia.org/T225713 (10fgiunchedi)
[11:02:39] <wikibugs>	 10Operations, 10media-storage: CPU scaling governor on ms-be hosts - https://phabricator.wikimedia.org/T225713 (10fgiunchedi)
[11:03:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "Looks good in general, I couldn't find a related commit allowing the load balancers to reach the configured ports in cloudelastic[1001-100" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson)
[11:22:04] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10StevenCrossin) Never mind this has been sorted out on our end
[11:25:37] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[11:29:26] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10Aklapper) @StevenCrossin: If there is nothing to do, feel free to set the status of this report to "Declined" via the {nav name=Add Action... > Change Status} dr...
[11:30:19] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10StevenCrossin) 05Open→03Declined Closed as sorted
[11:37:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:38:03] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:38:15] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[11:38:27] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:40:33] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:41:21] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:45:00] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10Tkshamburg) Hi @fgiunchedi ,  thanks for creating the new key (now 10...
[11:46:47] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:46:59] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[12:21:57] <wikibugs>	 10Operations, 10media-storage: CPU scaling governor on HP Gen9 hosts - https://phabricator.wikimedia.org/T225713 (10faidon)
[12:22:57] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10fgiunchedi) >>! In T225601#5256399, @Tkshamburg wrote: > Hi @fgiunche...
[12:32:38] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10Tkshamburg) Everything is fine now, "apt update" shows no errors now....
[12:54:45] <wikibugs>	 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10RobH) p:05Triage→03Normal
[12:56:25] <wikibugs>	 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10RobH) ` 5 $> ssh cr2-esams.wikimedia.org --- JUNOS 13.3R8.7 built 2015-10-23 21:23:16 UTC {master} robh@re0.cr2-esams> show power                          ^ syntax error, expectin...
[13:09:57] <wikibugs>	 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10RobH)
[13:10:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10jcrespo) Wait, some of these will go to the cloud racks, that needs planing!
[13:11:33] <wikibugs>	 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10RobH) My understanding is we won't be using any MX80s when this is all done, so I did not pull that info.  I'm not sure of the peak usage hours for each site, or if there is a jui...
[13:17:45] <wikibugs>	 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10ayounsi) I don't think Junos have that feature.  You can find the peak time of a device using their "overall traffic" graph in LibreNMS (eg. https://librenms.wikimedia.org/device/...
[13:18:17] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10fgiunchedi) 05Open→03Resolved No problem @Tkshamburg ! Thanks for...
[13:33:06] <wikibugs>	 10Operations, 10media-storage, 10observability: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications - https://phabricator.wikimedia.org/T222362 (10fgiunchedi)
[13:33:08] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage, 10observability, 10User-fgiunchedi: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10fgiunchedi) 05Open→03Resolved All done, resolving.
[13:49:54] <wikibugs>	 10Operations, 10Discovery, 10Discovery-Analysis, 10Product-Analytics, and 3 others: Setup a mirror for R language dependencies (CRAN) - https://phabricator.wikimedia.org/T170995 (10hashar) 05Open→03Declined maybe one day if we look again at R
[13:51:52] <wikibugs>	 (03CR) 10Bearloga: [C: 03+1] "@Gehel: Deb is okay with sunsetting the Portal stuff so we can proceed with this patch" [puppet] - 10https://gerrit.wikimedia.org/r/504577 (https://phabricator.wikimedia.org/T197138) (owner: 10Bearloga)
[13:54:25] <wikibugs>	 (03PS2) 10Matthias Mullie: Consistent beta wikidata urls, without www [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516753
[13:55:38] * debt with a sentimental sigh, yes on portal patch
[14:14:38] <wikibugs>	 (03PS2) 10Gehel: profile::discovery_dashboards: remove Wikipedia Portal dashboard [puppet] - 10https://gerrit.wikimedia.org/r/504577 (https://phabricator.wikimedia.org/T197138) (owner: 10Bearloga)
[14:15:49] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] profile::discovery_dashboards: remove Wikipedia Portal dashboard [puppet] - 10https://gerrit.wikimedia.org/r/504577 (https://phabricator.wikimedia.org/T197138) (owner: 10Bearloga)
[14:18:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.17, 27.93, 19.47
[14:19:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) Very good point!
[14:20:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 50.39, 27.41, 17.94
[14:21:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 66.85, 35.21, 21.32
[14:21:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui)
[14:22:49] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 87.18, 45.37, 27.49
[14:23:29] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 68.32, 36.81, 21.43
[14:24:31] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 52.33, 29.94, 19.13
[14:27:16] <godog>	 looks like api hosts, I'm takign a look at e.g. mw1222
[14:28:57] * apergos peeks in
[14:29:50] <godog>	 load is going back down but I don't know what caused load on api
[14:30:15] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 11.94, 23.34, 20.63
[14:30:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 13.22, 23.45, 22.29
[14:35:58] <apergos>	 I was looknig in logstash for mw1233 and didn't see anything that jumped out as to the number or type of requests really
[14:38:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 8.66, 14.86, 23.39
[14:39:23] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 7.88, 14.92, 23.89
[14:41:47] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 6.91, 12.69, 23.16
[14:42:49] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 6.94, 12.44, 23.70
[15:23:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 63.70, 37.80, 23.12
[15:25:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: fix swift symlink for WMCS LVs [puppet] - 10https://gerrit.wikimedia.org/r/516791
[15:26:47] <apergos>	 scribunto whines in hhvm on that box
[15:28:01] <apergos>	  \nFatal error: entire web request took longer than 200 seconds and timed out in /srv/mediawiki/php-1.34.0-wmf.8/extensions/Scribunto/includes/engines/LuaSandbox/Engine.php on line 282
[15:28:22] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10greg)
[15:28:41] <apergos>	 but load already dropping since then
[15:28:52] <apergos>	 back down to 21 now
[15:30:51] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 19.31, 24.31, 22.12
[15:31:10] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[15:31:42] <godog>	 uh oh
[15:32:04] <apergos>	 yeah
[15:32:32] <godog>	 cpu climbing on wdqs https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=wdqs&var-instance=All
[15:32:55] <apergos>	 gehel, onimisionipe?
[15:33:15] <gehel>	 onimisionipe: can you look?
[15:33:34] <godog>	 yeah looks like cpu is pretty much jammed
[15:35:06] <gehel>	 for once, it does not seem related to edit load
[15:35:39] <godog>	 yup, seeing requests being banned
[15:36:54] <gehel>	 !log restarting blazegraph on wdqs-internal / eqiad (just in case)
[15:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:00] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[15:42:48] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[15:43:20] <wikibugs>	 (03PS1) 10CDanis: varnish text FE: ban python-requests User-Agent on WDQS [puppet] - 10https://gerrit.wikimedia.org/r/516793
[15:45:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] varnish text FE: ban python-requests User-Agent on WDQS [puppet] - 10https://gerrit.wikimedia.org/r/516793 (owner: 10CDanis)
[15:47:56] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "+1 as a temporary measure" [puppet] - 10https://gerrit.wikimedia.org/r/516793 (owner: 10CDanis)
[15:50:10] <_joe_>	 I'd prefer if we do that at the nginx level
[15:50:10] <wikibugs>	 (03PS1) 10CDanis: wdqs: ban disallowed User-Agent at nginx [puppet] - 10https://gerrit.wikimedia.org/r/516794
[15:50:15] <_joe_>	 ^^
[15:50:18] <cdanis>	 gehel: please look at ^ instead
[15:50:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] wdqs: ban disallowed User-Agent at nginx [puppet] - 10https://gerrit.wikimedia.org/r/516794 (owner: 10CDanis)
[15:50:42] <wikibugs>	 (03Abandoned) 10CDanis: varnish text FE: ban python-requests User-Agent on WDQS [puppet] - 10https://gerrit.wikimedia.org/r/516793 (owner: 10CDanis)
[15:56:35] <onimisionipe>	 wow
[15:56:42] <onimisionipe>	 I'm late to the party
[15:56:48] * onimisionipe is reading backlog
[15:58:32] <gehel>	 !log restart blazegraph on wdqs public cluster
[15:58:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:18] <gehel>	 !log restart blazegraph on wdqs public cluster completed
[16:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 59.18, 28.59, 18.47
[16:27:03] <apergos>	 200 seconds seems like a long time to allow for a web request
[16:27:25] <anarcat>	 hello!
[16:27:40] <anarcat>	 is there a cumin release planned soon? i would love to not have to patch cumin on install :)
[16:27:43] <apergos>	 /srv/mediawiki/php-1.34.0-wmf.8/extensions/Scribunto/includes/engines/LuaSandbox/Engine.php  whines again 
[16:28:51] <apergos>	 I don't know the release schedule, and the folks who could answer that aren't around right now 
[16:30:25] <vgutierrez>	 volans: ^^
[16:30:49] <vgutierrez>	 anarcat: AFAIK volans is the best one to answer that
[16:31:19] <anarcat>	 yeah, that's what i figured as well
[16:31:35] <apergos>	 I don't see anything right away in phabricator, though I imagine you already looked there
[16:34:55] <apergos>	 https://doc.wikimedia.org/cumin/master/release.html  seem to be due for one :-)
[16:36:35] <anarcat>	 i did not, actually - didn't know where to look
[16:36:42] <anarcat>	 yeah, i looked on pypi and it seemed we're overdue
[16:37:21] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 12.74, 18.22, 23.49
[16:38:37] <apergos>	 https://phabricator.wikimedia.org/search/query/EVg.YagYZYue/#R    or maybe (if you dare) https://phabricator.wikimedia.org/tag/operations-software-development/ but there's other stuff mixed in on the workboard
[16:40:04] * anarcat dares
[16:40:18] <anarcat>	 undare! undare! undare!
[16:40:20] <anarcat>	 ;)
[16:40:41] <apergos>	 :-D :-D
[16:48:57] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Rate limit wdqs requests violating UA policy [puppet] - 10https://gerrit.wikimedia.org/r/516803
[16:58:04] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: Rate limit wdqs requests violating UA policy [puppet] - 10https://gerrit.wikimedia.org/r/516803 (owner: 10Vgutierrez)
[16:58:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Rate limit wdqs requests violating UA policy [puppet] - 10https://gerrit.wikimedia.org/r/516803 (owner: 10Vgutierrez)
[17:34:48] <bstorm_>	 !log T203254 set cpu scaling governor to performance on labstore1004 and labstore1005
[17:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:53] <stashbot>	 T203254: labstore1004 and labstore1005 high load issues following upgrades - https://phabricator.wikimedia.org/T203254
[17:44:16] <wikibugs>	 (03CR) 10EBernhardson: "Looking into how to allow the load balancers to reach the configured ports, it seems that is going to be our profile::elasticsearch::cirru" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson)
[17:44:53] <logmsgbot>	 !log fdans@deploy1001 Started deploy [analytics/refinery@67b34fe]: deploying refinery source 0.0.92 into refinery
[17:44:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:35] <icinga-wm>	 PROBLEM - SSH on proton1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:56:19] <icinga-wm>	 RECOVERY - SSH on proton1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:57:51] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:58:33] <icinga-wm>	 PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 2031 MB (1% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[18:01:38] <logmsgbot>	 !log fdans@deploy1001 Finished deploy [analytics/refinery@67b34fe]: deploying refinery source 0.0.92 into refinery (duration: 16m 45s)
[18:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:11] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:05:47] <icinga-wm>	 RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[18:10:24] <logmsgbot>	 !log fdans@deploy1001 Started deploy [analytics/refinery@67b34fe]: retrying deployment of analytics refinery
[18:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:43] <logmsgbot>	 !log fdans@deploy1001 Finished deploy [analytics/refinery@67b34fe]: retrying deployment of analytics refinery (duration: 00m 19s)
[18:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:45] <wikibugs>	 (03PS5) 10EBernhardson: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324)
[19:06:53] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[19:06:59] <icinga-wm>	 PROBLEM - dhclient process on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[19:06:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:07:05] <icinga-wm>	 PROBLEM - Disk space on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[19:07:09] <icinga-wm>	 PROBLEM - Check systemd state on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[19:07:13] <icinga-wm>	 PROBLEM - configured eth on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[19:07:25] <icinga-wm>	 PROBLEM - DPKG on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[19:07:45] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:08:33] <icinga-wm>	 PROBLEM - puppet last run on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[19:29:05] <icinga-wm>	 RECOVERY - DPKG on proton1001 is OK: All packages OK
[19:29:25] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on proton1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:30:03] <icinga-wm>	 RECOVERY - dhclient process on proton1001 is OK: PROCS OK: 0 processes with command name dhclient
[19:30:05] <icinga-wm>	 RECOVERY - Check size of conntrack table on proton1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:30:11] <icinga-wm>	 RECOVERY - Disk space on proton1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[19:30:15] <icinga-wm>	 RECOVERY - Check systemd state on proton1001 is OK: OK - running: The system is fully operational
[19:30:17] <icinga-wm>	 RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:30:19] <icinga-wm>	 RECOVERY - configured eth on proton1001 is OK: OK - interfaces up
[19:30:54] <wikibugs>	 (03PS1) 10Gehel: wdqs: limit number of messages from the same logger also for file logging. [puppet] - 10https://gerrit.wikimedia.org/r/516837
[19:59:49] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[20:00:24] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid: debian signing keyid E84AFDD2 has expired - https://phabricator.wikimedia.org/T141400 (10Kghbln)
[20:27:01] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:02:54] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul)
[21:08:04] <SMalyshev>	 I understand everybody is on SRE offsite, but maybe somebody can create next week for https://wikitech.wikimedia.org/wiki/Deployments ?
[21:13:32] <apergos>	 releng will do that I believe
[21:14:40] <SMalyshev>	 hope so... usually it appears couple of days in advance, but now there's nothing
[21:14:43] * apergos off for real this time (midnight)
[21:34:38] <wikibugs>	 (03PS1) 10Smalyshev: Also ban empty user agents [puppet] - 10https://gerrit.wikimedia.org/r/516959
[22:18:24] <wikibugs>	 (03CR) 10Smalyshev: [C: 03+1] wdqs: limit number of messages from the same logger also for file logging. [puppet] - 10https://gerrit.wikimedia.org/r/516837 (owner: 10Gehel)
[22:18:56] <wikibugs>	 (03CR) 10Smalyshev: [C: 03+1] "Given that we're counting the events in metrics, repeated logging messages are not much useful." [puppet] - 10https://gerrit.wikimedia.org/r/516837 (owner: 10Gehel)
[22:26:30] <wikibugs>	 (03CR) 10Smalyshev: [cirrus] Use correct factory declaration for EntityFullTextQueryBuilder (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516750 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse)
[22:29:01] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: re-enable diamond collectors for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/516967 (https://phabricator.wikimedia.org/T225265)
[22:40:43] <greg-g>	 SMalyshev: I'll get to it in a bit, been a bit crushed with things lately
[22:40:56] <SMalyshev>	 greg-g: thanks!
[22:48:23] <greg-g>	 SMalyshev: done: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_June_17th
[22:58:00] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: re-enable diamond collectors for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/516967 (https://phabricator.wikimedia.org/T225265) (owner: 10Bstorm)
[23:25:39] <SMalyshev>	 !log depooled wdqs1006 to let it catch up quicker
[23:25:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:06] <wikibugs>	 (03PS1) 10Tim Starling: Add a fatal error page to go with the proposed wmerrors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516975 (https://phabricator.wikimedia.org/T187147)
[23:32:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a fatal error page to go with the proposed wmerrors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516975 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling)
[23:40:24] <wikibugs>	 (03PS2) 10Tim Starling: Add a fatal error page to go with the proposed wmerrors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516975 (https://phabricator.wikimedia.org/T187147)
[23:41:49] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10Scap: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 (10greg)
[23:51:46] <wikibugs>	 (03CR) 10Krinkle: "I've moved the hhvm equivalent to puppet recently, in prep for making it share the error-page.erb template. Might make sense to do this fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516975 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling)
[23:54:17] <wikibugs>	 (03CR) 10Krinkle: Add a fatal error page to go with the proposed wmerrors feature (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516975 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling)
[23:57:17] <wikibugs>	 (03CR) 10Smalyshev: [C: 03+1] [cirrus] Use correct factory declaration for EntityFullTextQueryBuilder (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516750 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse)