[00:01:51] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Mails through deployment-mx SPF & DKIM fails - https://phabricator.wikimedia.org/T87338#4274226 (10Krenair) Gmail is now showing, with that cherry-picked: SPF: PASS with IP 208.80.155.138 Learn more DKIM: 'PASS' with domain beta.wmflabs.org Lea...
[00:18:03] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4274234 (10Papaul)
[00:35:12] <legoktm>	 !log remove non-deployers from wmf-deployment Gerrit group (T196959)
[00:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:40:11] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[00:41:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[00:42:17] <wikibugs>	 (03PS1) 10Papaul: DNS: Add mgmt & production DNS entries for lvs200[7-10] [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560)
[00:45:12] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4274283 (10Papaul)
[00:45:50] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[00:46:10] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[00:46:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[00:47:17] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[00:49:16] <wikibugs>	 (03CR) 10Paladox: "This will be used to add http://gerrit.wmfusercontent.org in a seperate commit which will then be used to supply avatars in gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[00:49:48] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to alias [puppet] - 10https://gerrit.wikimedia.org/r/439808
[00:50:16] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to alias [puppet] - 10https://gerrit.wikimedia.org/r/439808
[00:50:31] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439808 (owner: 10Paladox)
[00:54:44] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received
[00:56:55] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[01:54:02] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address and netboot entries for backup2001 [puppet] - 10https://gerrit.wikimedia.org/r/439830 (https://phabricator.wikimedia.org/T196477)
[02:02:52] <wikibugs>	 (03PS2) 10Papaul: DNS: Add mgmt DNS entries for bast2002 (supposed to be in public VLAN) [dns] - 10https://gerrit.wikimedia.org/r/439786 (https://phabricator.wikimedia.org/T196665)
[02:35:35] <logmsgbot>	 !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.7) (duration: 14m 10s)
[02:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:45:53] <logmsgbot>	 !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Jun 12 02:45:53 UTC 2018 (duration 10m 18s)
[02:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:59:13] <wikibugs>	 (03PS8) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342)
[04:00:15] <wikibugs>	 (03PS2) 10KartikMistry: Update apertium-apy initscripts [puppet] - 10https://gerrit.wikimedia.org/r/438135 (https://phabricator.wikimedia.org/T194342)
[05:01:05] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:04:05] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439835 (https://phabricator.wikimedia.org/T191316)
[05:04:25] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational
[05:06:41] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439835 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[05:08:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439835 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[05:08:29] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439835 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[05:08:56] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4274454 (10Marostegui) Not sure if the above actions by @mmodell should have shown any changes on the write patterns, but so far, they remain the same  https://grafana.wikimedi...
[05:09:30] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 for alter table (duration: 00m 52s)
[05:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:09:41] <marostegui>	 !log Deploy schema change on db1091 T191316 T192926 T89737 T195193
[05:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:09:48] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[05:09:48] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[05:09:48] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[05:09:48] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[05:19:25] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[05:22:45] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:46:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274466 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1230.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/20180612054...
[05:46:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274467 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1230.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['mw1230.eqiad.wmnet'] ```
[05:47:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274468 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1230.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/20180612054...
[05:48:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274469 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1230.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['mw1230.eqiad.wmnet'] ```
[05:48:09] <_joe_>	 this thing really doesn't work
[05:48:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274470 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1230.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/20180612054...
[05:48:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1230.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['mw1230.eqiad.wmnet'] ```
[05:49:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1230.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/20180612054...
[05:50:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274473 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1230.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['mw1230.eqiad.wmnet'] ```
[05:50:42] <_joe_>	 ok this is really really frustrating, I'll reimage that host by hand
[05:51:12] <_joe_>	 (╯°□°）╯︵ ┻━┻
[05:54:33] <elukey>	 what errors do you get??
[05:58:48] <icinga-wm>	 PROBLEM - mcrouter process on mw2235 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter
[06:03:03] <_joe_>	 this is me ^^
[06:03:10] <_joe_>	 I'm doing some further tests
[06:03:30] <_joe_>	 elukey: whatever, I don
[06:03:40] <_joe_>	 't have time for broken processes and broken docs
[06:10:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time
[06:16:18] <icinga-wm>	 RECOVERY - mcrouter process on mw2235 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter
[06:25:18] <icinga-wm>	 RECOVERY - configured eth on mw1230 is OK: OK - interfaces up
[06:25:19] <icinga-wm>	 RECOVERY - dhclient process on mw1230 is OK: PROCS OK: 0 processes with command name dhclient
[06:25:28] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439837
[06:25:28] <icinga-wm>	 RECOVERY - MD RAID on mw1230 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[06:25:39] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 6.56, 4.28, 2.28
[06:25:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1230 is OK: OK ferm input default policy is set
[06:25:58] <icinga-wm>	 RECOVERY - Disk space on mw1230 is OK: DISK OK
[06:25:59] <icinga-wm>	 RECOVERY - HHVM processes on mw1230 is OK: PROCS OK: 6 processes with command name hhvm
[06:26:09] <icinga-wm>	 RECOVERY - mcrouter process on mw1230 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter
[06:26:09] <icinga-wm>	 RECOVERY - DPKG on mw1230 is OK: All packages OK
[06:26:18] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1230 is OK: OK: nf_conntrack is 0 % full
[06:27:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 76410 bytes in 8.063 second response time
[06:27:28] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.424 second response time
[06:27:29] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439837 (owner: 10Marostegui)
[06:29:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439837 (owner: 10Marostegui)
[06:29:14] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439837 (owner: 10Marostegui)
[06:30:07] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 after alter table (duration: 00m 51s)
[06:30:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:30:20] <wikibugs>	 10Operations, 10ops-eqiad, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4274480 (10ArielGlenn)
[06:30:49] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439838 (https://phabricator.wikimedia.org/T191316)
[06:31:29] <icinga-wm>	 PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nova/policy.json]
[06:31:40] <marostegui>	 !log Stop replication on db1095, db1102, db1125 to change triggers - T192926
[06:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:45] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[06:34:56] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439838 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[06:36:27] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439838 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[06:37:36] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 for alter table (duration: 00m 50s)
[06:37:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:56] <marostegui>	 !log Deploy schema change on db1121 with replication, this will generate lag on labsdb:s4 T191316 T192926 T89737 T195193
[06:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:02] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[06:38:03] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[06:38:03] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[06:38:03] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[06:38:32] <wikibugs>	 (03PS2) 10Dzahn: Remove /xhprof from performance.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/439647 (https://phabricator.wikimedia.org/T196406) (owner: 10Imarlier)
[06:38:54] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439838 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[06:38:59] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on mw1230 is OK: OK: synced at Tue 2018-06-12 06:38:53 UTC.
[06:49:09] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Remove /xhprof from performance.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/439647 (https://phabricator.wikimedia.org/T196406) (owner: 10Imarlier)
[06:49:20] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[06:52:39] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:53:30] <icinga-wm>	 RECOVERY - Long running screen/tmux on mw1230 is OK: OK: No SCREEN or tmux processes detected.
[06:54:10] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw1230 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[06:55:39] <icinga-wm>	 RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:49] <icinga-wm>	 RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:07:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Two nits, looks good to me!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:11:25] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received
[07:11:34] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4274554 (10Lea_WMDE)
[07:11:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "This looks good to me, let me know when to merge it and how to test it to validate it." [puppet] - 10https://gerrit.wikimedia.org/r/439581 (https://phabricator.wikimedia.org/T196604) (owner: 10Aklapper)
[07:12:25] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[07:12:34] <wikibugs>	 (03CR) 10Dzahn: [C: 032] phabricator weekly project changes email: Add mysql slave port parameter [puppet] - 10https://gerrit.wikimedia.org/r/439581 (https://phabricator.wikimedia.org/T196604) (owner: 10Aklapper)
[07:12:39] <wikibugs>	 (03PS2) 10Dzahn: phabricator weekly project changes email: Add mysql slave port parameter [puppet] - 10https://gerrit.wikimedia.org/r/439581 (https://phabricator.wikimedia.org/T196604) (owner: 10Aklapper)
[07:12:57] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4081970 (10Lea_WMDE)
[07:13:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "One nit, looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:15:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] debmonitor: client side setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:17:51] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "tested by running  /usr/local/bin/project_changes.sh  on phab1001.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/439581 (https://phabricator.wikimedia.org/T196604) (owner: 10Aklapper)
[07:17:55] <icinga-wm>	 PROBLEM - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100%
[07:24:15] <icinga-wm>	 PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:24:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I don't think this is the approach we should take if we want to make all those files templates. I even tried going this way in the past an" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk)
[07:24:25] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp3037_v6
[07:24:26] <icinga-wm>	 PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:24:35] <icinga-wm>	 PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6
[07:24:35] <icinga-wm>	 PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:24:36] <icinga-wm>	 PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:24:36] <icinga-wm>	 PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:24:36] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6
[07:24:46] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp3037_v6
[07:24:46] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6
[07:24:55] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6
[07:24:55] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6
[07:25:05] <icinga-wm>	 PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:25:05] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6
[07:25:05] <icinga-wm>	 PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:25:05] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6
[07:25:05] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6
[07:25:15] <icinga-wm>	 PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:25:15] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6
[07:25:15] <icinga-wm>	 PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:25:15] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6
[07:25:25] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6
[07:25:25] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6
[07:25:25] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6
[07:25:26] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp3037_v6
[07:25:35] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp3037_v6
[07:25:45] <icinga-wm>	 PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6
[07:26:21] <godog>	 I'll take a look at cp3037
[07:26:36] <vgutierrez>	 I cannot reach it over the network nor over the management interface
[07:27:20] <godog>	 hah! might be dead in the water
[07:28:44] <godog>	 vgutierrez: I take it you'll keep on looking/followup? to avoid both working on it
[07:28:59] <vgutierrez>	 I'm trying to :)
[07:30:26] <wikibugs>	 10Operations, 10Wikimedia-Apache-configuration: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968#4274566 (10Joe)
[07:30:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I created a task about my plans here https://phabricator.wikimedia.org/T196968" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk)
[07:31:16] <mutante>	 !log closing idle screen session on tin (about to be decomed, dont use anymore)
[07:31:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:38] <volans>	 vgutierrez: interesting enough cp3037 doesn't have an icinga check for the mgmt interface...
[07:32:06] <vgutierrez>	 volans: also on librenms is showing traffic in real time...
[07:32:13] <vgutierrez>	 https://librenms.wikimedia.org/device/device=138/tab=port/port=10861/view=realtime/
[07:32:13] <volans>	 all the sourrounding cp30* have it ofc
[07:32:32] <vgutierrez>	 that... or the port is mislabeled :/
[07:34:41] <volans>	 so, the check was removed in the last run of puppet on icinga
[07:34:59] <volans>	 I'll check that part
[07:36:01] <wikibugs>	 (03PS6) 10Volans: debmonitor: client side setup [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300)
[07:36:03] <wikibugs>	 (03PS6) 10Volans: debmonitor: install debmonitor-client [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300)
[07:36:17] <wikibugs>	 (03CR) 10Volans: "Thanks for the review, replies inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:36:31] <wikibugs>	 (03CR) 10Volans: "Thanks for the review, replies inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:37:17] <wikibugs>	 (03CR) 10Volans: "The full compiler (with PS5) is available at:" [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:38:41] <moritzm>	 ipmitool "chassis status" is also failing for cp3037
[07:39:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] debmonitor: client side setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:39:38] <vgutierrez>	 right... xe-3/0/04 labeled as cp3037 is actually cp3036
[07:39:50] <mutante>	 how weird that the check for mgmt is gone
[07:39:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] debmonitor: client side setup [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:39:56] <vgutierrez>	 *xe-3/0/4 sorry
[07:40:00] <volans>	 yeah, I'm trying to understand why
[07:40:49] <mutante>	 i can confirm it and there seems nothing that explains it in site.pp.. same role as others and that gets added deep in base.pp
[07:41:03] <volans>	 it's not in the last compiled catalog of that host
[07:41:15] <akosiaris>	 !log ganeti2003 reboot for microcode update
[07:41:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:24] <volans>	 so seems that the resourse was not exported by puppet,, but why?
[07:42:17] <moritzm>	 a serial console session on cp3036.mgmt actually connects to cp3036 at least#
[07:43:03] <volans>	 it might be explained by:
[07:43:04] <volans>	 $facts['has_ipmi'] and $facts['ipmi_lan'] and 'ipaddress' in $facts['ipmi_lan']
[07:43:05] <icinga-wm>	 PROBLEM - Host ganeti2003 is DOWN: PING CRITICAL - Packet loss = 100%
[07:43:12] <volans>	 the mgmt checks are if guarded by the above
[07:43:35] <mutante>	 oh, that seems like it can explain it.. when DRAC breaks?
[07:43:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[07:43:49] <mutante>	 heh, does that mean it's like "only if DRAC works, then check it" 
[07:43:52] <akosiaris>	 !log ganeti2007 reboot for microcode update
[07:43:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:23] <volans>	 yeah, ipmi_lan fact is missing
[07:44:31] <mutante>	 i feel this will end in reseating DRAC and then it's back :p
[07:44:46] <akosiaris>	 isn't it embedded ? 
[07:44:54] <akosiaris>	 it's not a dedicated card, is it ?
[07:44:55] <icinga-wm>	 RECOVERY - Host ganeti2003 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[07:45:01] <mutante>	 then replacin the board
[07:45:05] <akosiaris>	 lol
[07:45:15] <akosiaris>	 well, drain flea power first I guess
[07:45:22] <vgutierrez>	 basically cp3036 and cp3037 labels are switched on asw-esams.. so xe-3/0/5 is cp3037 and the port is reported as physically up but no traffic there ofc
[07:46:05] <icinga-wm>	 PROBLEM - Host ganeti2007 is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:13] <_joe_>	 uh?
[07:46:17] <_joe_>	 oh ok
[07:46:29] <akosiaris>	 nothing to see here, move along
[07:46:47] <_joe_>	 yeah I read the DOWN and then read backscroll
[07:46:47] <mutante>	 maybe that if-guard should have an else-branch that says "WARN - no IPMI IP" 
[07:46:56] <volans>	 so yeah, for me 'bmc-config -o -S Lan_Conf' failed / didn't return valid data, and the fact was not populated, hence the resourse was not exported
[07:47:33] <volans>	 mutante: yeah, but we need then to if guard everything with an is_virtual
[07:47:35] <icinga-wm>	 RECOVERY - Host ganeti2007 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[07:47:58] <volans>	 also, puppet defines a check, not it's return value ;)
[07:48:01] <volans>	 it's a bit tricky
[07:48:06] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439842
[07:48:31] <mutante>	 yep, node
[07:48:34] <mutante>	 *nod*
[07:48:37] <volans>	 vgutierrez: fwiw at the end of april its remote ipmi was working (I've done an audit)
[07:50:01] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439842 (owner: 10Marostegui)
[07:51:12] <vgutierrez>	 from bast3002, at least the mgmt interface is reachable aka 3-way handshake but I'm not able to get a proper ssh session there
[07:51:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439842 (owner: 10Marostegui)
[07:52:29] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439842 (owner: 10Marostegui)
[07:53:25] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 after alter table (duration: 00m 50s)
[07:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:48] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4274603 (10fgiunchedi) Yeah I think it might have been the controller barfing and the disk is actually ok.   I couldn't find related logs on lithium tho so hard to know for sure. The disk can be sent back, we'll o...
[08:03:08] <marostegui>	 !log Deploy schema change on s1 codfw primary master (db2048) with replication, this will generate lag on codfw T191316 T192926 T89737 T195193
[08:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:14] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[08:03:15] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[08:03:15] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[08:03:15] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[08:04:29] <akosiaris>	 !log ganeti2006 reboot for microcode update
[08:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:25] <icinga-wm>	 PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100%
[08:07:25] <icinga-wm>	 PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:07:26] <icinga-wm>	 PROBLEM - etcd request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:08:15] <icinga-wm>	 RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[08:08:52] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4274635 (10Lea_WMDE)
[08:09:41] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4081970 (10Lea_WMDE)
[08:09:45] <icinga-wm>	 RECOVERY - etcd request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:10:13] <wikibugs>	 (03CR) 10Volans: "Compiler is still happy: https://puppet-compiler.wmflabs.org/compiler02/11449/" [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[08:10:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4)
[08:10:45] <icinga-wm>	 RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:15:19] <akosiaris>	 !log ganeti2002 reboot for microcode update
[08:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:45] <icinga-wm>	 PROBLEM - Host ganeti2002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:15:56] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4274651 (10Lea_WMDE)
[08:18:05] <icinga-wm>	 RECOVERY - Host ganeti2002 is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms
[08:18:47] <icinga-wm>	 PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.140 second response time
[08:21:46] <wikibugs>	 10Operations, 10ops-esams, 10netops: cp3036 and cp3037 production ports mislabeled - https://phabricator.wikimedia.org/T196970#4274656 (10Vgutierrez)
[08:22:30] <wikibugs>	 (03PS2) 10Dvorapa: toollabs: install python{,3}-pymysql on exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999)
[08:23:50] <akosiaris>	 fwiw most tools workers seem fine at a kubectl get nodes -o wide 
[08:24:04] <akosiaris>	 checker is probably having a hiccup
[08:25:04] <wikibugs>	 (03CR) 10Alex Monk: "So if I understand correctly what you're saying is that having puppet generate files this size through templates slows it to a crawl and t" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk)
[08:25:32] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1066 to master [puppet] - 10https://gerrit.wikimedia.org/r/439530 (https://phabricator.wikimedia.org/T194870)
[08:25:40] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Set s2 as read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870)
[08:25:47] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870)
[08:25:54] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Update s2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/439533 (https://phabricator.wikimedia.org/T194870)
[08:26:03] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubernetes: Remove deprecated --api-servers parameter [puppet] - 10https://gerrit.wikimedia.org/r/436483
[08:26:22] <wikibugs>	 (03CR) 10Volans: [C: 032] debmonitor: client side setup [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[08:27:11] <arturo>	 yeah
[08:27:35] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: systemd: add define specific to timers [puppet] - 10https://gerrit.wikimedia.org/r/417948
[08:28:11] <arturo>	 not sure how checker generates that info
[08:29:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "> So if I understand correctly what you're saying is that having" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk)
[08:30:24] <wikibugs>	 (03PS1) 10Volans: debmonitor: fix directory creation [puppet] - 10https://gerrit.wikimedia.org/r/439849
[08:30:25] <icinga-wm>	 PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:30:45] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:30:59] <volans>	 it's me.... sorry
[08:31:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] debmonitor: fix directory creation [puppet] - 10https://gerrit.wikimedia.org/r/439849 (owner: 10Volans)
[08:31:05] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:14] <wikibugs>	 (03CR) 10Volans: [C: 032] debmonitor: fix directory creation [puppet] - 10https://gerrit.wikimedia.org/r/439849 (owner: 10Volans)
[08:31:15] <icinga-wm>	 PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:16] <icinga-wm>	 PROBLEM - puppet last run on elastic2035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:25] <icinga-wm>	 PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:25] <icinga-wm>	 PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:35] <icinga-wm>	 PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:36] <icinga-wm>	 PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:45] <icinga-wm>	 PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:55] <icinga-wm>	 PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:55] <icinga-wm>	 PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:31:55] <volans>	 fixing
[08:31:56] <icinga-wm>	 PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:32:05] <icinga-wm>	 PROBLEM - puppet last run on elastic2030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:32:05] <icinga-wm>	 PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:32:06] <icinga-wm>	 PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:32:06] <icinga-wm>	 PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:32:06] <icinga-wm>	 PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:32:10] <moritzm>	 I'm shutting up icinga-wm
[08:32:16] <icinga-wm>	 PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl]
[08:32:17] <volans>	 thx
[08:32:35] <akosiaris>	 volans: should I take the opportunity to also reboot puppetdb for the spec-ctrl thing ?
[08:32:42] <volans>	 go for it!
[08:32:43] <volans>	 :D
[08:33:36] <akosiaris>	 !log reboot puppetdb1001 for spec-ctrl enable. Bundling it with a minor puppet outage to only have a torrent of harmless puppet failures once
[08:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:51] <volans>	 akosiaris: let me know when done, so I do only one run of cumin run puppet on failed ones
[08:35:03] <akosiaris>	 arturo: btw some worker nodes are cordoned. I guess you are aware, just mentioning
[08:35:06] <logmsgbot>	 !log ema@neodymium conftool action : set/pooled=no; selector: name=cp3037.esams.wmnet
[08:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:24] <akosiaris>	 volans: up and running
[08:35:36] <volans>	 akosiaris: ack, thanks
[08:35:52] <arturo>	 akosiaris: actually I don't know why
[08:35:58] <akosiaris>	 !log rebalance ganeti codfw cluster
[08:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:06] <volans>	 !log running puppet on failed hosts post small puppet outage and puppetdb reboot
[08:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:12] <akosiaris>	 and now the hopefully final round of VM reboots
[08:39:33] <wikibugs>	 (03PS1) 10Volans: debmonitor: fix newlines in conf file [puppet] - 10https://gerrit.wikimedia.org/r/439852
[08:41:52] <wikibugs>	 (03CR) 10Volans: [C: 032] "Now it's correct:" [puppet] - 10https://gerrit.wikimedia.org/r/439852 (owner: 10Volans)
[08:43:11] <akosiaris>	 arturo: I am gonna merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/436483/. It should be noop as I 've already tested it 
[08:43:29] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: kubernetes: Remove deprecated --api-servers parameter [puppet] - 10https://gerrit.wikimedia.org/r/436483
[08:43:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Remove deprecated --api-servers parameter [puppet] - 10https://gerrit.wikimedia.org/r/436483 (owner: 10Alexandros Kosiaris)
[08:46:16] <arturo>	 akosiaris: ok
[08:48:02] <marostegui>	 !log Stop replication on db2094 to change triggers for archive table
[08:48:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:48] <volans>	 arturo: I've noticed (from a puppetcompiler failure) that in the labs/privare repo the wmcs/monitoring/wmcs_monitoring_rsync key is missing. I was about to add it as snakeoil, but double checking with you in case a real one is needed there
[08:53:29] <volans>	 *labs/private
[08:54:33] <arturo>	 volans: yeah, probably just the actual private exists, and not in labs/private
[08:57:25] <volans>	 yes, it's in the real private one and not in the 'public' private
[08:58:16] <icinga-wm>	 RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:58:25] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[08:58:25] <icinga-wm>	 RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:58:26] <icinga-wm>	 RECOVERY - puppet last run on wtp1042 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[08:58:46] <icinga-wm>	 RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:59:05] <icinga-wm>	 RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[08:59:14] <volans>	 the puppet run has completed, all good
[08:59:46] <icinga-wm>	 RECOVERY - puppet last run on ms-be1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[09:00:05] <icinga-wm>	 RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:00:26] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[09:00:36] <volans>	 akosiaris: FYI puppet fails on kubernetes2003.codfw.wmnet
[09:00:48] <volans>	 Systemd start for docker failed!
[09:01:14] <akosiaris>	 volans: yeah ignore it
[09:01:22] <volans>	 ack
[09:01:22] <akosiaris>	 I am still fighting with the imaging process
[09:03:46] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:04:05] <wikibugs>	 (03PS1) 10Volans: Add missing wmcs/monitoring dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/439856
[09:04:08] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748)
[09:04:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris)
[09:04:20] <volans>	 arturo: ^^^
[09:04:58] <arturo>	 volans: but that kubernetes has nothing to do with toolforge, right?
[09:05:05] <volans>	 my patch
[09:05:21] <arturo>	 oh, I ingore `wikibugs` :-P
[09:05:24] <arturo>	 ignore*
[09:05:30] <volans>	 ahhhh
[09:05:31] <volans>	 :D
[09:05:36] <volans>	 https://gerrit.wikimedia.org/r/439856
[09:06:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] Add missing wmcs/monitoring dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/439856 (owner: 10Volans)
[09:06:14] <arturo>	 volans: +2
[09:06:25] <wikibugs>	 (03CR) 10Volans: [V: 032] Add missing wmcs/monitoring dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/439856 (owner: 10Volans)
[09:06:31] <volans>	 ack done :)
[09:07:16] <arturo>	 thanks volans !
[09:07:26] <volans>	 yw
[09:07:54] <wikibugs>	 (03PS7) 10Volans: debmonitor: install debmonitor-client [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300)
[09:08:15] <wikibugs>	 (03PS10) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254
[09:08:17] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748)
[09:08:30] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: enable G1 garbage collector [puppet] - 10https://gerrit.wikimedia.org/r/437231 (https://phabricator.wikimedia.org/T156137)
[09:10:51] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch: enable G1 garbage collector [puppet] - 10https://gerrit.wikimedia.org/r/437231 (https://phabricator.wikimedia.org/T156137) (owner: 10Gehel)
[09:12:11] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch: enable G1 garbage collector [puppet] - 10https://gerrit.wikimedia.org/r/437231 (https://phabricator.wikimedia.org/T156137) (owner: 10Gehel)
[09:12:52] <wikibugs>	 (03PS11) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254
[09:12:54] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748)
[09:12:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: conftool: Add the mathoid service to kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/439857
[09:14:37] <wikibugs>	 (03PS8) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298)
[09:15:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[09:16:42] <arturo>	 the icinga-wm bot left the #wikimedia-cloud-feed channel, how can I tell it to rejoin?
[09:17:42] <wikibugs>	 (03CR) 10Volans: "Latest compiler results: https://puppet-compiler.wmflabs.org/compiler02/11452/" [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[09:18:19] <volans>	 arturo: mmmh checking, it was restarted by puppet, so should have re-joined all channels
[09:18:52] <arturo>	 volans: oh sorry it actually rejoined. irccloud wasn't clear about that :-P
[09:19:02] <volans>	 ah ok, that makes sense
[09:19:38] <arturo>	 I didn't get the recovery message from the toolforge k8s thing
[09:19:44] <arturo>	 because the bot left
[09:19:46] <arturo>	 but is now ok
[09:19:54] <wikibugs>	 (03PS9) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298)
[09:21:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[09:29:45] <icinga-wm>	 RECOVERY - DPKG on multatuli is OK: All packages OK
[09:30:32] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860
[09:30:46] <icinga-wm>	 PROBLEM - ircecho bot process on kraz is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py
[09:31:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 (owner: 10Jcrespo)
[09:31:48] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860
[09:31:55] <icinga-wm>	 RECOVERY - ircecho bot process on kraz is OK: PROCS OK: 1 process with command name python, regex args /usr/local/bin/udpmxircecho.py
[09:32:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 (owner: 10Jcrespo)
[09:32:46] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860
[09:33:16] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974#4274802 (10Vgutierrez)
[09:33:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 (owner: 10Jcrespo)
[09:34:06] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860
[09:34:41] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860
[09:35:23] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974#4274802 (10Vgutierrez) p:05Triage>03Normal
[09:36:25] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100% Vgutierrez T196974
[09:37:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 (owner: 10Jcrespo)
[09:41:55] <vgutierrez>	 !log cp3037 has been depooled due to unknown hardware issues T196974
[09:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:01] <stashbot>	 T196974: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974
[09:44:36] <icinga-wm>	 RECOVERY - nutcracker process on mw1230 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[09:44:45] <icinga-wm>	 RECOVERY - nutcracker port on mw1230 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[09:44:48] <wikibugs>	 (03PS1) 10Volans: debmonitor: fine tune nginx fail_timeout [puppet] - 10https://gerrit.wikimedia.org/r/439865 (https://phabricator.wikimedia.org/T191299)
[09:44:56] <icinga-wm>	 RECOVERY - Check systemd state on mw1230 is OK: OK - running: The system is fully operational
[09:46:12] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4274837 (10Lea_WMDE)
[09:50:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks fine, we can do a real world test when debmonitor us run the first time on trusty." [puppet] - 10https://gerrit.wikimedia.org/r/439865 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans)
[09:50:55] <wikibugs>	 (03CR) 10Volans: "Two nit/questions inline" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[09:51:19] <wikibugs>	 (03CR) 10Volans: [C: 032] debmonitor: fine tune nginx fail_timeout [puppet] - 10https://gerrit.wikimedia.org/r/439865 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans)
[10:01:36] <wikibugs>	 (03PS10) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298)
[10:02:35] <icinga-wm>	 PROBLEM - Host mwdebug2002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:02:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[10:03:15] <icinga-wm>	 RECOVERY - Host mwdebug2002 is UP: PING OK - Packet loss = 0%, RTA = 36.32 ms
[10:05:20] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974
[10:05:20] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974
[10:05:20] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974
[10:05:20] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974
[10:05:20] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974
[10:05:20] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974
[10:06:32] <wikibugs>	 10Operations, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460#4274898 (10Aklapper) a:05bbogaert>03None
[10:14:25] <icinga-wm>	 PROBLEM - DPKG on multatuli is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[10:16:55] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received
[10:17:09] <volans>	 multatuli it mor.itz and me playing with debmonitor
[10:18:49] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4274935 (10Lea_WMDE)
[10:19:06] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy
[10:19:32] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10Lea_WMDE)
[10:20:55] <wikibugs>	 (03PS1) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871
[10:21:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo)
[10:21:25] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[10:21:34] <godog>	 !log bounce stuck rsyslog on lithium / wezen - T136312
[10:21:36] <godog>	 that's me ^
[10:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:39] <stashbot>	 T136312: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312
[10:21:45] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1229 days)
[10:21:51] <wikibugs>	 (03PS3) 10Dvorapa: toollabs: install python{,3}-pymysql on exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999)
[10:24:55] <wikibugs>	 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4274967 (10awight) Wikidata wouldn't survive a year of this upper-bound unscalability.  It has received 200M edits in the past 12 months, so we would have...
[10:25:56] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10WMDE-Fisch)
[10:26:25] <jynus>	 !log setting expire_log_days on db1066 as 30
[10:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:55] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3046 is CRITICAL: CRITICAL: expiry mailbox lag is 2014133
[10:29:05] <icinga-wm>	 PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:29:05] <icinga-wm>	 PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:29:15] <icinga-wm>	 PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:29:26] <icinga-wm>	 PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:30:16] <icinga-wm>	 PROBLEM - puppet last run on db2075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:30:45] <icinga-wm>	 PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:30:45] <icinga-wm>	 PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:31:12] <godog>	 taking a look, puppetdb perhaps
[10:31:58] <godog>	 indeed
[10:32:06] <icinga-wm>	 PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:09] <godog>	  Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=5457825afc630dada2b6fbdbd3395d5b61c3ff12&version=5&certname=dbstore2001.codfw.wmnet&command=replace_facts&producer-timestamp=1528799187' on at least 1 of the following 'server_urls': https://puppetdb2001.codfw.wmnet
[10:32:25] <icinga-wm>	 PROBLEM - puppet last run on mc2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:26] <icinga-wm>	 PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:36] <icinga-wm>	 PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:45] <icinga-wm>	 PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:05] <icinga-wm>	 PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:14] <godog>	 should be recovering
[10:33:15] <icinga-wm>	 PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:35] <icinga-wm>	 PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:45] <icinga-wm>	 PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:35:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4275004 (10Marostegui)
[10:35:23] <volans>	 probably the ganeti restarts
[10:35:37] <volans>	 akosiaris: was puppetdb2001 also in the loop for restarts?
[10:39:42] <moritzm>	 it was in the list of hosts needing a reboot at least
[10:42:44] <volans>	 ack
[10:43:53] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10thiemowmde) I'm afraid I did not fully understood what "linking to test wiki" means? Should https://test.wikipe...
[10:44:25] <icinga-wm>	 RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:45:54] <wikibugs>	 (03PS11) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298)
[10:47:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[10:55:37] <wikibugs>	 (03PS1) 10WMDE-Fisch: Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370)
[10:55:39] <wikibugs>	 (03PS1) 10WMDE-Fisch: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370)
[10:57:36] <icinga-wm>	 RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[10:57:55] <icinga-wm>	 RECOVERY - puppet last run on mc2031 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[10:58:05] <icinga-wm>	 RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[10:58:15] <icinga-wm>	 RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:58:16] <icinga-wm>	 RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:58:35] <icinga-wm>	 RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[10:58:36] <icinga-wm>	 RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:58:56] <icinga-wm>	 RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:59:05] <icinga-wm>	 RECOVERY - puppet last run on ms-be2032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:59:15] <icinga-wm>	 RECOVERY - puppet last run on mc2035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[10:59:36] <icinga-wm>	 RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[10:59:46] <icinga-wm>	 RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:00:06] <icinga-wm>	 RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:00:55] <icinga-wm>	 RECOVERY - puppet last run on db2075 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:00:58] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4275065 (10WMDE-Fisch) >>! In T195370#4275029, @thiemowmde wrote: > I'm afraid I did not fully understood what "linking to...
[11:01:16] <icinga-wm>	 RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:01:16] <icinga-wm>	 RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:02:35] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404)
[11:02:57] <wikibugs>	 (03PS2) 10WMDE-Fisch: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370)
[11:03:45] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[11:06:45] <wikibugs>	 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4275076 (10awight) Some negatives to the per-page approach: * Slightly incompatible with ORES, which is per-revision.  For example, fetching an ORES+JADE...
[11:30:36] <wikibugs>	 10Operations, 10cloud-services-team, 10Patch-For-Review: cloud vps: disable system-wide apt pinning for OpenStack jessie hosts - https://phabricator.wikimedia.org/T196659#4275100 (10aborrero)  I tried generating an apt pinning file containing the dependencies of keystone which are present in jessie-backports...
[11:32:10] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: keystone: use install_options to install from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439589 (https://phabricator.wikimedia.org/T196633)
[11:39:35] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3046 is OK: OK: expiry mailbox lag is 131634
[11:40:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "According to the compiler, this should be fine:" [puppet] - 10https://gerrit.wikimedia.org/r/439589 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[11:46:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633)
[11:47:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[11:47:48] <moritzm>	 !log updated component/cassandra311 on apt.wikimedia.org to 3.11.2
[11:47:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:57] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633)
[11:50:04] <wikibugs>	 10Operations, 10Cassandra, 10User-Eevans: Add Cassandra 3.11.2 package to internal APT repository - https://phabricator.wikimedia.org/T196745#4275154 (10MoritzMuehlenhoff) 05Open>03Resolved Imported via Secure Apt (release key is signed by Eric with whom I've signed keys) and added to component/cassandra...
[11:50:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[11:52:32] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633)
[11:53:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Puppet compiler is rather good:" [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[11:54:45] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
[11:54:56] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1229 days)
[11:57:23] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[11:58:15] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:58:22] <wikibugs>	 10Operations, 10cloud-services-team, 10Patch-For-Review: cloud vps: disable system-wide apt pinning for OpenStack jessie hosts - https://phabricator.wikimedia.org/T196659#4275185 (10aborrero) Finally, the `-t jessie-backports` thing went really smooth.  Puppet output:  {P7248}
[11:58:35] <wikibugs>	 10Operations, 10cloud-services-team, 10Patch-For-Review: cloud vps: disable system-wide apt pinning for OpenStack jessie hosts - https://phabricator.wikimedia.org/T196659#4275186 (10aborrero) 05Open>03Resolved a:03aborrero
[11:58:45] <icinga-wm>	 PROBLEM - Check systemd state on db1068 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:58:47] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[11:59:57] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4275191 (10Paladox) phabricator is going to parse the existing refs/changes/*/*/meta commits (no new ones will be added to the queue so this will eventually go down). According...
[12:00:26] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[12:01:12] <wikibugs>	 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4275206 (10fgiunchedi) p:05Normal>03High Thanks @Cmjohnson ! Please treat this with urgency, do you know if there's an ETA? If more than a couple of days I'll remove the...
[12:01:42] <akosiaris>	 moritzm: all VMs rebooted (once more). I think (hope actually) we are finally OK
[12:01:51] <akosiaris>	 volans: yeah it was as moritzm pointed out
[12:02:01] <volans>	 np
[12:02:03] <volans>	 thx
[12:02:32] <akosiaris>	 I might break conftool btw. I am merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439857/
[12:02:35] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: conftool: Add the mathoid service to kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/439857
[12:02:37] <wikibugs>	 (03PS12) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254
[12:02:39] <wikibugs>	 (03PS8) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748)
[12:03:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] conftool: Add the mathoid service to kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/439857 (owner: 10Alexandros Kosiaris)
[12:03:46] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1229 days)
[12:05:13] <moritzm>	 akosiaris: thanks! I've just doublechecked via cumin; all ganeti instances are running an IBPB-enabled kernel
[12:05:47] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: dc=.*,service=mathoid,cluster=kubernetes,name=.*
[12:05:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:02] <akosiaris>	 yay
[12:09:18] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316)
[12:10:51] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#4275228 (10fgiunchedi) Latest rsyslog release containing the fix is already packaged in Debian unstable, it'd be easier to backport that to stretch instead of jessie. Once w...
[12:11:07] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[12:11:09] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[12:11:28] <marostegui>	 !log Deploy schema change on dbstore1002:s1 T191316 T192926 T89737 T195193
[12:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:35] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[12:11:35] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[12:11:35] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[12:11:35] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[12:12:45] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[12:13:02] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[12:14:00] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 for alter table (duration: 00m 52s)
[12:14:02] <marostegui>	 !log Deploy schema change on db1099:3311 T191316 T192926 T89737 T195193
[12:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:15] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: also install python-routes from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439888 (https://phabricator.wikimedia.org/T196633)
[12:17:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: also install python-routes from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439888 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[12:18:59] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: also install python-routes from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439888 (https://phabricator.wikimedia.org/T196633)
[12:19:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: keystone: also install python-routes from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439888 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[12:22:05] <icinga-wm>	 PROBLEM - mailman list info on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:22:25] <wikibugs>	 (03PS1) 10Paladox: Copy wikimedia-polygerrit-style.html to static/gerrit-theme.html [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439889
[12:22:44] <wikibugs>	 (03PS2) 10Paladox: Copy wikimedia-polygerrit-style.html to static/gerrit-theme.html [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439889
[12:22:45] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:23:04] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439889 (owner: 10Paladox)
[12:23:05] <icinga-wm>	 RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15502 bytes in 3.212 second response time
[12:24:07] <jynus>	 I still cannot access mailman, can you?
[12:24:48] <wikibugs>	 (03PS1) 10Paladox: Copy GerritSite.css and GerritSiteHeader.html from puppet repo [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439890
[12:25:24] <wikibugs>	 (03PS2) 10Paladox: Copy GerritSite.css and GerritSiteHeader.html from puppet repo [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439890
[12:25:39] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439890 (owner: 10Paladox)
[12:26:37] <jynus>	 it worked finally
[12:26:50] <jynus>	 maybe it got overloaded after starting?
[12:27:21] <jynus>	 oh, actually it wasn't restarted, so it is something else
[12:28:11] <jynus>	 spikes of load in the last 3 days
[12:28:37] <moritzm>	 lots of listinfo processes
[12:28:45] <jynus>	 I will check for a ticket and file one CC herron akosiaris
[12:29:05] <moritzm>	 i.e. /var/lib/mailman/scripts/driver listinfo
[12:29:17] <jynus>	 I thought it was a host restart
[12:29:22] <jynus>	 that is I wasn't too worried
[12:29:41] <moritzm>	 uptime is five days
[12:29:51] <jynus>	 yeah, I notice that only recently
[12:29:56] <jynus>	 *ced
[12:30:27] <jynus>	 it is again unavailable to me
[12:31:05] <icinga-wm>	 PROBLEM - mailman list info on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:31:33] <moritzm>	 per prometheus there was a similar spike (also load of 120) yesterday at 5:30
[12:31:57] <jynus>	 yes, and 2 and 3 days ago
[12:32:06] <icinga-wm>	 RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15502 bytes in 5.301 second response time
[12:32:45] <jynus>	 I saw no ongoing ticket, will create one
[12:33:39] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: eqiad1: actually use a false value for keystone daemon [puppet] - 10https://gerrit.wikimedia.org/r/439891 (https://phabricator.wikimedia.org/T196633)
[12:33:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational
[12:34:06] <_joe_>	 !log repooling mw1230 after reimaging T196881
[12:34:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:11] <stashbot>	 T196881: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881
[12:34:21] <akosiaris>	 hmm maybe spam
[12:34:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4275315 (10Joe) 05Open>03Resolved a:03Joe
[12:36:05] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275321 (10jcrespo)
[12:36:21] <moritzm>	 nothing odd in mailman logs AFAICT (they're fairly noisy as plenty of (abandoned?) lists are repeatedly logged
[12:36:39] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275331 (10jcrespo)
[12:36:41] <akosiaris>	 https://grafana.wikimedia.org/dashboard/db/mail?refresh=5m&orgId=1&from=now-24h&to=now doesn't point to any spam spike
[12:37:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Puppet compiler is good:" [puppet] - 10https://gerrit.wikimedia.org/r/439891 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[12:37:17] <akosiaris>	 what is that listinfo thing ?
[12:37:27] <jynus>	 I think the first thing is to know if http requests hanging is a cause or a consequence
[12:37:59] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275334 (10MoritzMuehlenhoff) Load was in the 120 ballpark and there were total of 141 "/usr/bin/python -S /var/lib/mailman/scripts/driver listinfo" processes running.
[12:38:32] <jynus>	 this last time seems more sustained
[12:38:38] <ema>	 !log cp3035: restart varnish-be, mbox lag
[12:38:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:12] <moritzm>	 seems to be CGI which does "Produce listinfo page, primary web entry-point to mailing lists"
[12:40:29] <akosiaris>	 so e.g. http://lists.wikimedia.org/mailman/listinfo/betacluster-alerts would call it I guess
[12:40:59] <akosiaris>	 I think I have the culprit
[12:41:26] <jynus>	 please share, or fix it first and then share :-)
[12:41:45] <icinga-wm>	 PROBLEM - mailman archives on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:42:55] <icinga-wm>	 RECOVERY - mailman archives on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 73975 bytes in 8.734 second response time
[12:43:02] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275349 (10jcrespo)
[12:43:25] <akosiaris>	 I 've banned a very specific IP
[12:43:25] <icinga-wm>	 PROBLEM - mailman list info on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:44:10] <akosiaris>	 doesn't look like it helped though
[12:45:25] <akosiaris>	 I 've stopped apache and everything has subsided ...
[12:45:32] <akosiaris>	 so this is HTTP requests related
[12:45:46] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3035 is OK: OK: expiry mailbox lag is 0
[12:45:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: base: keystone service requires false as boolean [puppet] - 10https://gerrit.wikimedia.org/r/439892 (https://phabricator.wikimedia.org/T196633)
[12:45:53] <jynus>	 but for how long?
[12:46:02] <jynus>	 if mailman overloads itself
[12:46:03] <akosiaris>	 probably not for long
[12:46:16] <akosiaris>	 so this is not mailman overloading itself
[12:46:20] <akosiaris>	 it's someone external overloading it
[12:46:24] <akosiaris>	 and I 've already banned an IP
[12:46:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968)
[12:46:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: convert wikibooks to vhost [puppet] - 10https://gerrit.wikimedia.org/r/439894 (https://phabricator.wikimedia.org/T196968)
[12:46:36] <icinga-wm>	 RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15500 bytes in 0.102 second response time
[12:46:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: base: keystone service requires false as boolean [puppet] - 10https://gerrit.wikimedia.org/r/439892 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[12:47:13] <akosiaris>	 yeah the listinfo process have subsided very much
[12:47:20] <akosiaris>	 I see only like a few now
[12:47:26] <_joe_>	 elukey, Krenair https://gerrit.wikimedia.org/r/439893 and the followup, I'd like your opinion
[12:47:39] <_joe_>	 basically my idea is to convert all sites to use that define
[12:49:41] <moritzm>	 that IP you dropped made nearly 1300 requests today, maybe that fixed it, but the backlog is so large that we hadn't seen recovering effects yet
[12:50:04] <jynus>	 I will leave the topic
[12:50:16] <jynus>	 and the ticket for longer term analysis
[12:50:22] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275360 (10akosiaris) I 've banned a specific IP (I 'll share it in a private paste later on), restarted apache and everything seems to be ok now
[12:50:25] <jynus>	 is herron mostly working on email?
[12:50:56] <jynus>	 as in, is he the right person to take that or someone else?
[12:51:47] <jynus>	 I see the load going back up again
[12:51:49] <elukey>	 _joe_ seems a nice idea!
[12:52:31] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275364 (10akosiaris) P7249 for the list of IPs
[12:53:28] <jynus>	 yeah, it is going to fail again
[12:53:55] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 2065091
[12:54:06] <elukey>	 _joe_ assuming of course that the vhosts will have the same structure in the future (I think this is the case since they haven't checked a lot)
[12:54:13] <elukey>	 but +1 from me, no concerns
[12:54:23] <elukey>	 I also like the clarity of the define in the puppet config
[12:54:42] <elukey>	 I was wondering if mod_macro could have been used instead but probably too messy
[12:57:47] <wikibugs>	 (03PS1) 10Paladox: Planet: Set xmlmaxarticles to 100 in rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/439897
[12:58:19] <wikibugs>	 (03PS2) 10Paladox: Planet: Set xmlmaxarticles to 100 in rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/439897
[12:58:37] <wikibugs>	 (03PS3) 10Paladox: Planet: Set xmlmaxarticles to 100 in rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/439897 (https://phabricator.wikimedia.org/T196965)
[12:59:29] <wikibugs>	 (03PS4) 10Paladox: Planet: Set xmlmaxarticles to 100 in config [puppet] - 10https://gerrit.wikimedia.org/r/439897 (https://phabricator.wikimedia.org/T196965)
[13:00:05] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1300).
[13:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[13:00:16] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: Open Phab tasks on SMART failure - https://phabricator.wikimedia.org/T196994#4275410 (10fgiunchedi) p:05Triage>03Normal
[13:00:18] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439897 (https://phabricator.wikimedia.org/T196965) (owner: 10Paladox)
[13:00:28] <zeljkof>	 nice, no patches, no swat ;)
[13:00:50] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Planet: Set xmlmaxarticles to 100 in config [puppet] - 10https://gerrit.wikimedia.org/r/439897 (https://phabricator.wikimedia.org/T196965) (owner: 10Paladox)
[13:04:04] <addshore>	 zeljkof: woo no swat patches
[13:04:06] <addshore>	 CFisch_remote: around?
[13:04:15] <addshore>	 jouncebot: next
[13:04:15] <jouncebot>	 In 0 hour(s) and 55 minute(s): FileImporter and FileExporter in group0 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1400)
[13:04:29] <CFisch_remote>	 addshore: jepp
[13:04:49] <CFisch_remote>	 but a bit distracted
[13:04:54] <addshore>	 zeljkof: I'll start my next window now then as there is nothing in swat, and the first patch requires a fill sync, (YAY)
[13:05:08] <zeljkof>	 addshore: go ahead :D
[13:05:09] <addshore>	 CFisch_remote: is the patch on the branch ready? :)
[13:05:29] <CFisch_remote>	 nope I wanted to prepare that just before 4pm 
[13:05:34] <CFisch_remote>	 but we can do it now
[13:05:40] <addshore>	 ack :)
[13:07:32] <CFisch_remote>	 https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/FileExporter/+/439900/
[13:07:42] <wikibugs>	 (03CR) 10Hashar: [C: 031] Gerrit: Make PolyGerrit the default ui (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox)
[13:08:15] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1230 is OK: OK
[13:08:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 031] "nitpick & inline doubt, but it's looking good :D" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/417948 (owner: 10Giuseppe Lavagetto)
[13:09:53] <wikibugs>	 (03CR) 10Paladox: Gerrit: Make PolyGerrit the default ui (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox)
[13:11:24] <CFisch_remote>	 addshore: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/FileExporter/+/439900/
[13:12:29] <wikibugs>	 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4275456 (10Cmjohnson) @fgiunchedi  I submitted a ticket with HP. I recommend removing the server from swift until it's fixed since I do not know what it's going to take to f...
[13:12:51] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: use validate_cmd for rules and config files [puppet] - 10https://gerrit.wikimedia.org/r/432074
[13:13:35] <wikibugs>	 (03PS6) 10Elukey: Move the varnishkafka submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437467 (https://phabricator.wikimedia.org/T188377)
[13:13:37] <wikibugs>	 (03PS2) 10Elukey: Move the kafkatee submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437950 (https://phabricator.wikimedia.org/T188377)
[13:13:39] <wikibugs>	 (03PS2) 10Elukey: Move the jmxtrans submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437951 (https://phabricator.wikimedia.org/T188377)
[13:13:41] <wikibugs>	 (03PS1) 10Elukey: Move the nginx submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/439901 (https://phabricator.wikimedia.org/T188377)
[13:14:03] <elukey>	 nope --^ didn't work
[13:14:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] prometheus: use validate_cmd for rules and config files [puppet] - 10https://gerrit.wikimedia.org/r/432074 (owner: 10Filippo Giunchedi)
[13:15:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move the nginx submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/439901 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey)
[13:15:33] <wikibugs>	 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4275468 (10awight) In the per-page schema proposed above, the page-revision index would grow at the scary rate, up to one index entry per revision added t...
[13:16:09] <moritzm>	 !log installing openjdk-8 security updates on restbase-dev along with cassandra restarts
[13:16:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:24] <wikibugs>	 (03Abandoned) 10Elukey: Move the nginx submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/439901 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey)
[13:21:29] <addshore>	 CFisch_remote: cool!
[13:21:57] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: alert on config reload failure [puppet] - 10https://gerrit.wikimedia.org/r/432059
[13:24:18] <godog>	 mhh fermium still with its cpu pegged, taking a look
[13:24:44] <wikibugs>	 (03PS2) 10Ema: varnish: Remove setting of CP cookies [puppet] - 10https://gerrit.wikimedia.org/r/437774 (https://phabricator.wikimedia.org/T110353) (owner: 10Krinkle)
[13:24:45] <addshore>	 CFisch_remote: apparently my internet is gone...
[13:24:46] <akosiaris>	 godog: https://phabricator.wikimedia.org/T196989#4275364
[13:25:05] <addshore>	 Just waiting for it to come back..
[13:25:52] <wikibugs>	 (03CR) 10Zhuyifei1999: "@Dvorapa Please don't bother rebasing patches in ops/puppet, unless it cannot be auto-rebased (conflict). It will be rebased by the person" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999)
[13:26:23] <CFisch_remote>	 addshore: ^^'
[13:26:27] <wikibugs>	 (03CR) 10Ema: [C: 032] varnish: Remove setting of CP cookies [puppet] - 10https://gerrit.wikimedia.org/r/437774 (https://phabricator.wikimedia.org/T110353) (owner: 10Krinkle)
[13:26:43] <addshore>	 right, merging https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/FileExporter/+/439900/ on the .7 branch
[13:26:47] <addshore>	 *waits for CI*
[13:28:01] <wikibugs>	 (03CR) 10Dvorapa: "> @Dvorapa Please don't bother rebasing patches in ops/puppet, unless" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999)
[13:28:44] <godog>	 akosiaris: thanks! yeah looks like more offenders, load at 100+
[13:29:14] <wikibugs>	 (03CR) 10Dvorapa: "Also sorry for some unrelated test accounts, I've overclicked" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999)
[13:30:06] <icinga-wm>	 PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 503 (expecting: 200)
[13:30:42] <wikibugs>	 (03PS1) 10Gehel: maps: upgrade to cassandra-2.2.6-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/439905 (https://phabricator.wikimedia.org/T196044)
[13:31:15] <icinga-wm>	 RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy
[13:31:49] <wikibugs>	 (03CR) 10Gehel: [C: 032] maps: upgrade to cassandra-2.2.6-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/439905 (https://phabricator.wikimedia.org/T196044) (owner: 10Gehel)
[13:32:46] <Zoranzoki21>	 Hi, can you deploy https://gerrit.wikimedia.org/r/#/c/436211/
[13:32:47] <godog>	 in this case it'd be also nice if we could ask mod_cgi to always limits its concurrency heh
[13:32:47] <Zoranzoki21>	 Thanks!
[13:33:00] <addshore>	 CFisch_remote: looks merged to me
[13:33:54] <CFisch_remote>	 addshore: lets assume its merged then ;-)
[13:33:56] <wikibugs>	 (03PS5) 10Zoranzoki21: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270)
[13:36:04] <addshore>	 CFisch_remote: right, pulled onto tin, and now pulled onto mwdebug1002
[13:36:10] <addshore>	 *checks nothing is somehow broken*
[13:37:54] <CFisch_remote>	 I mean in theory nothing of this should be loaded atm 
[13:37:54] <wikibugs>	 (03PS2) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871
[13:37:57] <wikibugs>	 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4275526 (10Gehel) cassandra-2.2.6-wmf5 deployed on maps-test2004, it seems to work just fine.
[13:37:59] <CFisch_remote>	 but you never know
[13:38:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo)
[13:38:31] <logmsgbot>	 !log addshore@deploy1001 Started scap: [[gerrit:439900|FileExporter backport]] - Pre deployment backport (extension not yet deployed)
[13:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:37] <addshore>	 CFisch_remote: ^^
[13:39:09] <CFisch_remote>	 affirmative 
[13:39:11] <Zoranzoki21>	 addshore: Sorry, can you deploy https://gerrit.wikimedia.org/r/#/c/436211/?
[13:39:21] <wikibugs>	 (03PS4) 10Hoo man: Support prefixed dump types [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil)
[13:39:37] <wikibugs>	 10Operations, 10Deployments, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#4275527 (10Addshore) Just got this while syncing:  ``` 13:38:32 Sta...
[13:40:07] <addshore>	 sorry Zoranzoki21, as it wasn't in the calendar I have started something else, and the current sync will take ~45 mins
[13:40:29] <Zoranzoki21>	 addshore: Ok, I can add for next swat?
[13:40:33] <addshore>	 yup
[13:40:40] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] Support prefixed dump types [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil)
[13:41:00] <Zoranzoki21>	 addshore: tnx
[13:47:27] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275321 (10fgiunchedi) Looks like high load is back with a whole lot of `listinfo` requests
[13:49:27] <wikibugs>	 (03PS3) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871
[13:49:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo)
[13:54:04] <wikibugs>	 (03PS1) 10Volans: Drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300)
[13:54:16] <wikibugs>	 (03PS1) 10Ema: vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553)
[13:55:14] <wikibugs>	 (03PS2) 10Volans: Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300)
[13:55:59] <wikibugs>	 (03CR) 10BBlack: [C: 031] vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema)
[13:56:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[13:58:49] <wikibugs>	 (03PS1) 10Herron: mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989)
[14:00:04] <jouncebot>	 addshore and CFisch_WMDE: Dear deployers, time to do the FileImporter and FileExporter in group0 deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1400).
[14:00:26] <addshore>	 O/
[14:00:36] <wikibugs>	 (03PS2) 10Ema: vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553)
[14:00:52] <addshore>	 CFisch_remote: internet just dropped again...
[14:01:00] <CFisch_remote>	 oh man
[14:01:02] <addshore>	 Or, DNS did. Mhmpf
[14:01:25] <CFisch_remote>	 at least you do not need to have a connection all the time for things to run ^^
[14:01:32] <wikibugs>	 (03PS1) 10Paladox: planet: Add labs common.yaml file to add hiera keys for labs only [puppet] - 10https://gerrit.wikimedia.org/r/439913
[14:03:08] <wikibugs>	 (03PS2) 10Paladox: planet: Add labs common.yaml file to add hiera keys for labs only [puppet] - 10https://gerrit.wikimedia.org/r/439913
[14:03:35] <wikibugs>	 (03PS3) 10Paladox: planet: Add labs common.yaml file to add hiera keys for labs only [puppet] - 10https://gerrit.wikimedia.org/r/439913
[14:04:06] <wikibugs>	 (03CR) 10Dzahn: [C: 032] planet: Add labs common.yaml file to add hiera keys for labs only [puppet] - 10https://gerrit.wikimedia.org/r/439913 (owner: 10Paladox)
[14:04:19] <wikibugs>	 (03PS4) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871
[14:04:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo)
[14:06:17] <wikibugs>	 (03PS1) 10Paladox: planet: Add meta link to labs hiera value [puppet] - 10https://gerrit.wikimedia.org/r/439914
[14:06:25] <addshore>	 CFisch_remote: yup, woo for screen!
[14:06:34] <wikibugs>	 (03PS3) 10Ema: vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553)
[14:06:36] <wikibugs>	 (03PS2) 10Paladox: planet: Add meta link to labs hiera value [puppet] - 10https://gerrit.wikimedia.org/r/439914
[14:06:44] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439914 (owner: 10Paladox)
[14:07:58] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275640 (10akosiaris) Yeah, found something new, I 've reblocked some stuff, I 'll update P7249. Things do look normal again, this might just...
[14:08:12] <wikibugs>	 (03CR) 10Dzahn: [C: 032] planet: Add meta link to labs hiera value [puppet] - 10https://gerrit.wikimedia.org/r/439914 (owner: 10Paladox)
[14:09:09] <logmsgbot>	 !log addshore@deploy1001 Finished scap: [[gerrit:439900|FileExporter backport]] - Pre deployment backport (extension not yet deployed) (duration: 30m 37s)
[14:09:13] <addshore>	 CFisch_remote: ^^
[14:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:18] <addshore>	 only 30 mins now, woo
[14:09:33] <addshore>	 right, now onto the config I understand CFisch_remote ?
[14:10:07] <CFisch_remote>	 yes so next the config
[14:10:15] <CFisch_remote>	 the upper one first
[14:10:21] <CFisch_remote>	 and the the other
[14:10:52] <addshore>	 the upper one? :P
[14:10:53] <CFisch_remote>	 the last one is where it get's interesting and things can explode :-D
[14:10:59] <addshore>	 on the calendar ?:)
[14:11:04] <CFisch_remote>	 yep
[14:11:33] <CFisch_remote>	 or https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/439875/ to be precise
[14:13:00] <addshore>	 any reason you chose to put the test wikis that way around? (test2 being a source and test being a target)? just curious :D
[14:13:53] <CFisch_remote>	 addshore: when looking on the upload pages you will see that test2 has a super big warning to not upload things there
[14:14:03] <addshore>	 ooooh, cool
[14:14:15] <wikibugs>	 (03PS6) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[14:14:15] <CFisch_remote>	 ( but still the form is shown and people do it )
[14:14:26] <CFisch_remote>	 so we thought it might be better to have it that way
[14:14:27] <wikibugs>	 (03PS2) 10Addshore: Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[14:14:33] <wikibugs>	 (03PS3) 10Addshore: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[14:15:01] <addshore>	 sounds like a good reason to me
[14:15:29] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to alias [puppet] - 10https://gerrit.wikimedia.org/r/439808
[14:15:31] <wikibugs>	 (03CR) 10Addshore: [C: 032] Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[14:16:17] <wikibugs>	 (03PS4) 10Ema: vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553)
[14:16:41] <wikibugs>	 (03PS1) 10Herron: mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989)
[14:16:51] <wikibugs>	 (03CR) 10Ema: [C: 032] vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema)
[14:17:02] <wikibugs>	 (03Merged) 10jenkins-bot: Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[14:17:08] <wikibugs>	 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4275648 (10fgiunchedi) @Cmjohnson ok! thanks, I'll being removing the machine from swift tomorrow
[14:17:44] <wikibugs>	 (03PS3) 10Volans: Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300)
[14:18:21] <wikibugs>	 (03CR) 10jenkins-bot: Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[14:18:28] <addshore>	 CFisch_remote: first patch is on mwdebug1002, *checks the world is still there*
[14:18:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[14:19:17] <CFisch_remote>	 :-)
[14:20:10] <addshore>	 syncing patch #1
[14:20:55] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: FileImporter/Exporter [[gerrit:439875|Allow setting of export target for FileExporter]] T195370 (duration: 00m 50s)
[14:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:59] <stashbot>	 T195370: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370
[14:21:07] <wikibugs>	 (03CR) 10Addshore: [C: 032] Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[14:21:48] <addshore>	 CFisch_remote: and in goes patch #2
[14:22:06] <addshore>	 CFisch_remote: i guess you should be able to kind of fully test this while on mwdebug1002? :)
[14:22:27] <wikibugs>	 (03Merged) 10jenkins-bot: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[14:22:35] <wikibugs>	 (03PS2) 10Herron: mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989)
[14:22:54] <CFisch_remote>	 addshore: I hope so
[14:22:58] <wikibugs>	 (03CR) 10jenkins-bot: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch)
[14:23:01] <CFisch_remote>	 I have the tabs open 
[14:23:08] <addshore>	 CFisch_remote: it is done
[14:23:13] <addshore>	 mwdebug1002 that is
[14:23:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:23:28] <addshore>	 you should totally update the author part too! im currently the only person listed there :O
[14:23:36] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:23:41] * CFisch_remote checks
[14:24:24] <CFisch_remote>	 beta feature is there
[14:24:28] <CFisch_remote>	 link text is in
[14:24:45] <wikibugs>	 (03CR) 10Herron: "worth mentioning that this will add some dns lookup overhead to requests matching the REQUEST_URL. hits are cached." [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:24:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:25:04] <addshore>	 CFisch_remote: https://usercontent.irccloud-cdn.com/file/qU92P8Ol/image.png
[14:25:17] <CFisch_remote>	 o.O
[14:25:21] <CFisch_remote>	 I get "File uploads are not available on this wiki. If you have a legitimate need to test uploading, local bureaucrats can assign you the relevant right. "
[14:25:42] <addshore>	 oooh, you dont have the ability to upload files? :P
[14:25:45] <addshore>	 let me give you a flag
[14:26:03] <CFisch_remote>	 but that's a problem
[14:26:10] <CFisch_remote>	 so users cannot really test this
[14:26:10] <addshore>	 oh noes =o
[14:26:22] <CFisch_remote>	 why do I see the upload form but then do not have the rights to upload
[14:26:23] <CFisch_remote>	 ahhh
[14:26:30] <CFisch_remote>	 that's stupid
[14:26:40] <CFisch_remote>	 damn
[14:26:47] <addshore>	 so, autoconfirmed users should be able to upload
[14:26:48] <CFisch_remote>	 hmm ...
[14:27:04] <addshore>	 you just must not be autoconfirmed on testwiki
[14:27:06] <wikibugs>	 (03CR) 10Herron: mailman: add per IP rate limit of 50 requests per 5 min (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:27:14] <addshore>	 whats your username?
[14:27:14] <CFisch_remote>	 maybe
[14:27:24] <CFisch_remote>	 Christoph Jauera (WMDE)
[14:27:55] <addshore>	 https://usercontent.irccloud-cdn.com/file/Vr2WQnKY/image.png
[14:27:56] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[14:28:17] <addshore>	 there must be another right needed for uploading that isnt in the group
[14:28:22] <CFisch_remote>	 damn so it's something different
[14:28:23] <CFisch_remote>	 yeah
[14:28:43] <CFisch_remote>	 and Lea is in a meeting and I can't reach her ... 
[14:28:52] <CFisch_remote>	 hard to say what we should do now
[14:28:58] <wikibugs>	 (03PS13) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254
[14:29:00] <wikibugs>	 (03PS9) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748)
[14:29:21] <addshore>	 CFisch_remote: do uselang=qqx, what is the message key for that message?
[14:30:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 (owner: 10Alexandros Kosiaris)
[14:30:31] <CFisch_remote>	 addshore: hard to say it comes after the post when I try to upload for real
[14:30:38] <addshore>	 aaah lame
[14:30:47] <CFisch_remote>	 and thats strange because we do the upload check at the beginning
[14:30:47] <addshore>	 I think thats an on wiki override *looks for it*
[14:30:57] <CFisch_remote>	 it must be triggered somewhere "inside"
[14:31:13] <addshore>	 CFisch_remote: its abusefilter
[14:31:14] <addshore>	 :D
[14:31:18] <addshore>	 https://test.wikipedia.org/w/index.php?search=File+uploads+are+not+available+on+this+wiki&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B%228%22%5D%7D&ns8=1&searchToken=2rnymx7bzcm0ly5x604x0pc5d
[14:31:21] <CFisch_remote>	 wtf :-D
[14:31:25] <CFisch_remote>	 nice
[14:31:37] <addshore>	 *looks at abusefilter*
[14:32:00] <wikibugs>	 (03PS2) 10Herron: mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989)
[14:32:02] <CFisch_remote>	 we could lower that rule for the test phase then
[14:32:20] <addshore>	 CFisch_remote: https://test.wikipedia.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=160
[14:32:27] <addshore>	 CFisch_remote: https://test.wikipedia.org/wiki/Special:AbuseFilter/160
[14:32:32] <addshore>	 requires autopatrol or reviewer
[14:33:17] <wikibugs>	 (03CR) 10Herron: [C: 032] mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:33:21] <wikibugs>	 (03PS3) 10Herron: mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989)
[14:33:55] <CFisch_remote>	 addshore: do you have rights to change that filter - so we temporarily disable it for the group0 test phase?
[14:34:01] <addshore>	 I do, hmm
[14:34:16] <CFisch_remote>	 ( it should be 2 weeks I think )
[14:34:19] <addshore>	 so, at what point do you hit that? once on the preview page and youve made changes etc?
[14:34:40] <wikibugs>	 (03CR) 10Dzahn: "this results in a line "ServerAlias" that isn't followed by an alias. Not sure if Apache will hate this." [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[14:34:44] <CFisch_remote>	 after the preview page
[14:34:49] <addshore>	 CFisch_remote: I'll disable it now so you can try again
[14:34:53] <CFisch_remote>	 when you press upload
[14:35:05] <wikibugs>	 (03CR) 10Ottomata: "Yeehaw, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema)
[14:35:43] <addshore>	 CFisch_remote: try again? :)
[14:35:46] <CFisch_remote>	 addshore: "This action has been automatically identified as harmful, and therefore disallowed. If you believe your action was constructive, please inform an administrator of what you were trying to do. A brief description of the abuse rule which your action matched is: Mass upload stop "
[14:35:48] <addshore>	 oh wait i failed
[14:35:51] <CFisch_remote>	 next filter ^^
[14:36:06] <CFisch_remote>	 man that sucks :-)
[14:36:10] <addshore>	 CFisch_remote: try now :)
[14:36:23] <CFisch_remote>	 \o/
[14:36:25] <CFisch_remote>	 worked
[14:36:28] <addshore>	 okay
[14:36:35] <addshore>	 right, im gonna do the rest of the sync then
[14:36:47] <CFisch_remote>	 phew nice, thank you so much addshore 
[14:37:16] <addshore>	 syncing
[14:37:47] <wikibugs>	 (03CR) 10Paladox: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[14:38:05] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: FileImporter/Exporter [[gerrit:439876|Enable FileExporter/Importer on group0 wikis]] T195370 (duration: 00m 51s)
[14:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:10] <stashbot>	 T195370: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370
[14:38:38] <addshore>	 CFisch_remote: in a meeting now
[14:38:43] <addshore>	 !log file exporter importer slot done
[14:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See nits inline, LGTM in general. If we are running into problems with legitimate clients we can introduce rate limits instead of outright" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:39:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:39:58] <addshore>	 and CFisch_remote i made you an admin on test
[14:40:18] <addshore>	 CFisch_remote: so you can turn the filter back on after testing etc
[14:40:25] <CFisch_remote>	 nice, thanks again
[14:40:26] <CFisch_remote>	 yepp
[14:44:36] <wikibugs>	 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 2 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4003039 (10akosiaris) Hello,  I 've stalled adding LVS configuration for proton due to an instability we've been noticing. This instability i...
[14:44:38] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "per our IRC dicussion, should be a separate vhost, not a ServerAlias" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[14:45:18] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4275730 (10WMDE-Fisch)
[14:49:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:50:25] <icinga-wm>	 RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 76182 bytes in 0.233 second response time
[14:51:52] <wikibugs>	 (03PS1) 10Herron: mailman: add recently observed false UA to bad_browser check [puppet] - 10https://gerrit.wikimedia.org/r/439922 (https://phabricator.wikimedia.org/T196989)
[14:53:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] mailman: add recently observed false UA to bad_browser check [puppet] - 10https://gerrit.wikimedia.org/r/439922 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:53:39] <wikibugs>	 (03CR) 10Herron: [C: 032] mailman: add recently observed false UA to bad_browser check [puppet] - 10https://gerrit.wikimedia.org/r/439922 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[14:56:29] <wikibugs>	 (03CR) 10Nuria: [C: 031] vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema)
[14:56:44] <wikibugs>	 (03PS5) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871
[14:57:15] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10Tobi_WMDE_SW) >>! In T195370#4275730, @WMDE-Fisch wrote: > Can now be tested, e.g. on https://test2.wikipedia.o...
[14:57:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo)
[14:58:03] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4275778 (10Addshore) >>! In T195370#4275772, @Tobi_WMDE_SW wrote: >>>! In T195370#4275730, @WMDE-Fisch wrote: >> Can now b...
[15:00:13] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275788 (10herron) General http(s) request rate limiting has been enabled for requests matching `\/mailman.*` with a threshold of 50 requests...
[15:00:33] <wikibugs>	 (03PS1) 10Ema: vcl: properly choose backend in vcl_pipe [puppet] - 10https://gerrit.wikimedia.org/r/439929 (https://phabricator.wikimedia.org/T196553)
[15:00:36] <wikibugs>	 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4275789 (10awight) 05Open>03Resolved a:03awight
[15:00:50] <wikibugs>	 (03PS4) 10Volans: Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300)
[15:01:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[15:02:15] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[15:03:23] <wikibugs>	 (03PS6) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871
[15:03:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo)
[15:03:49] <elukey>	 seems upload esams having trouble
[15:03:53] <elukey>	 cc ema --^
[15:04:01] <ema>	 yup, cp3039
[15:04:02] <ema>	 thanks elukey 
[15:04:05] <elukey>	 <3
[15:04:06] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[15:07:25] <ema>	 !log cp3039: restart varnish-backend 
[15:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:43] <wikibugs>	 (03PS7) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871
[15:14:24] <wikibugs>	 (03CR) 10Ema: [C: 032] vcl: properly choose backend in vcl_pipe [puppet] - 10https://gerrit.wikimedia.org/r/439929 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema)
[15:14:58] <wikibugs>	 (03PS1) 10BBlack: esams rebalance: move cp3043 from text to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936
[15:15:16] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 0
[15:16:56] <wikibugs>	 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4275890 (10Johan) Translations are being collected at https://meta.wikimedia.org/wiki/User:Johan_(WMF)/AES128-SHA
[15:19:39] <wikibugs>	 (03CR) 10Dzahn: [C: 032] DNS: Add mgmt DNS entries for bast2002 (supposed to be in public VLAN) [dns] - 10https://gerrit.wikimedia.org/r/439786 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul)
[15:19:42] <wikibugs>	 (03PS1) 10Paladox: Add gerrit.wmfusercontent.org to common/cache/misc.yaml [puppet] - 10https://gerrit.wikimedia.org/r/439939
[15:20:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Mask the default uwsgi service for ores [puppet] - 10https://gerrit.wikimedia.org/r/437984 (owner: 10Muehlenhoff)
[15:20:30] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Mask the default uwsgi service for ores [puppet] - 10https://gerrit.wikimedia.org/r/437984 (owner: 10Muehlenhoff)
[15:20:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Mask the default uwsgi service for ores [puppet] - 10https://gerrit.wikimedia.org/r/437984 (owner: 10Muehlenhoff)
[15:20:48] <wikibugs>	 (03PS2) 10Paladox: Add gerrit.wmfusercontent.org to common/cache/misc.yaml [puppet] - 10https://gerrit.wikimedia.org/r/439939
[15:21:12] <wikibugs>	 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183#4275931 (10Paladox)
[15:21:55] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[15:22:15] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[15:22:23] <wikibugs>	 (03PS3) 10Paladox: Add gerrit.wmfusercontent.org to common/cache/misc.yaml [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183)
[15:22:59] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "this is adding a new director called "gerrit" (which already exists). what you want instead is adding a new domain to the existing directo" [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[15:23:42] <wikibugs>	 (03PS4) 10Paladox: Add gerrit.wmfusercontent.org to common/cache/misc.yaml [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183)
[15:24:19] <wikibugs>	 (03PS2) 10BBlack: esams rebalance: move cp3043 from text to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936
[15:25:58] <wikibugs>	 (03PS1) 10Addshore: Enable FileImporter monolog channel in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439941 (https://phabricator.wikimedia.org/T195370)
[15:26:57] <wikibugs>	 (03PS3) 10Herron: mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989)
[15:27:48] <wikibugs>	 (03PS4) 10Herron: mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989)
[15:28:55] <wikibugs>	 (03CR) 10Herron: [C: 032] mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[15:29:13] <wikibugs>	 (03PS5) 10Dzahn: cache::misc: Add gerrit backend, gerrit.wmfusercontent.org  [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[15:29:36] <bblack>	 !log cp3043 switching from text to upload shortly, downtimed in icinga for 2h - https://gerrit.wikimedia.org/r/c/operations/puppet/+/439936
[15:29:39] <wikibugs>	 (03PS6) 10Dzahn: cache::misc: Add gerrit backend, gerrit.wmfusercontent.org  [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[15:29:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: jobrunner: reduce the number of old runners [puppet] - 10https://gerrit.wikimedia.org/r/439943 (https://phabricator.wikimedia.org/T197003)
[15:29:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: jobrunner: reduce to one redis server per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/439944 (https://phabricator.wikimedia.org/T197003)
[15:29:55] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[15:30:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Reduce the jobqueue redis to use just one server per dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439945 (https://phabricator.wikimedia.org/T197003)
[15:31:14] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "not affecting anything prod so far. gerrit itself isnt behind misc::web, this is for hosting avatars in the future" [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox)
[15:31:53] <paladox>	 mutante thanks :)
[15:33:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11463/ 2 runners per job type are more than enough given the current traffic." [puppet] - 10https://gerrit.wikimedia.org/r/439943 (https://phabricator.wikimedia.org/T197003) (owner: 10Giuseppe Lavagetto)
[15:33:55] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: jobrunner: reduce the number of old runners [puppet] - 10https://gerrit.wikimedia.org/r/439943 (https://phabricator.wikimedia.org/T197003)
[15:34:41] <wikibugs>	 (03CR) 10Ema: [C: 031] esams rebalance: move cp3043 from text to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936 (owner: 10BBlack)
[15:37:05] <icinga-wm>	 PROBLEM - mailman list info on fermium is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/mailman/listinfo/wikimedia-l - 298 bytes in 0.008 second response time
[15:38:39] <wikibugs>	 (03CR) 10Ema: [C: 031] Set eventstreams max_connections to 25 per varnish instance [puppet] - 10https://gerrit.wikimedia.org/r/439772 (https://phabricator.wikimedia.org/T196553) (owner: 10Ottomata)
[15:38:57] <wikibugs>	 (03PS5) 10Volans: Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300)
[15:39:39] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "confirmed with racadm getsysinfo" [puppet] - 10https://gerrit.wikimedia.org/r/439792 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul)
[15:39:48] <wikibugs>	 (03PS2) 10Dzahn: DHCP: Add MAC address for bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/439792 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul)
[15:39:50] <wikibugs>	 (03PS1) 10Herron: Revert "mailman: perform rbl checks on listinfo requests" [puppet] - 10https://gerrit.wikimedia.org/r/439948
[15:40:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[15:40:18] <bblack>	 !log cp3034 - nevermind, doing different approach later in the day, still pooled in text for now!
[15:40:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:53] <wikibugs>	 (03CR) 10Herron: [C: 032] "This is seeming too aggressive in testing after deployment.  reverting." [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[15:41:18] <wikibugs>	 (03CR) 10Volans: "The only failure are the py27 tests as expected due to T196628" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[15:41:48] <wikibugs>	 (03CR) 10Herron: [C: 032] Revert "mailman: perform rbl checks on listinfo requests" [puppet] - 10https://gerrit.wikimedia.org/r/439948 (owner: 10Herron)
[15:41:54] <wikibugs>	 (03PS2) 10Herron: Revert "mailman: perform rbl checks on listinfo requests" [puppet] - 10https://gerrit.wikimedia.org/r/439948
[15:42:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Install LFS on scap targets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight)
[15:46:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Found in testing, e.g. my home ip address was being 403'd" [puppet] - 10https://gerrit.wikimedia.org/r/439948 (owner: 10Herron)
[15:48:56] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4276003 (10chasemp) a:05Cmjohnson>03Bstorm
[15:51:27] <wikibugs>	 (03CR) 10Dzahn: [C: 032] DNS: Add mgmt & production DNS entries for lvs200[7-10] [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul)
[15:51:54] <wikibugs>	 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4276013 (10Halfak) I don't think we should be designing for the worst-case scenario here.  There are many situations where content creation patterns are c...
[15:51:56] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4276012 (10herron) I'm still able to generate noticeable load by hitting listinfo repeatedly within the 50req/5 min rate limit, so we might be...
[15:52:18] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "looks right. needs manual rebase. doing that" [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul)
[15:55:28] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4276025 (10chasemp) I think this is ready for OS install and such?  I spoke with @bstorm who is going to take this on and may need...
[15:55:53] <wikibugs>	 (03PS2) 10Dzahn: DNS: Add mgmt & production DNS entries for lvs200[7-10] [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul)
[15:57:18] <wikibugs>	 (03CR) 10Dzahn: [C: 032] DNS: Add mgmt & production DNS entries for lvs200[7-10] [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul)
[15:58:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good und unblocks Python 3 packages :-)" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[15:59:26] <wikibugs>	 (03PS7) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[15:59:40] <wikibugs>	 (03PS8) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[15:59:43] <wikibugs>	 (03CR) 10Volans: [V: 032 C: 032] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[16:00:00] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11450/" [puppet] - 10https://gerrit.wikimedia.org/r/439772 (https://phabricator.wikimedia.org/T196553) (owner: 10Ottomata)
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:04] <wikibugs>	 (03PS2) 10Ottomata: Set eventstreams max_connections to 25 per varnish instance [puppet] - 10https://gerrit.wikimedia.org/r/439772 (https://phabricator.wikimedia.org/T196553)
[16:00:09] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Set eventstreams max_connections to 25 per varnish instance [puppet] - 10https://gerrit.wikimedia.org/r/439772 (https://phabricator.wikimedia.org/T196553) (owner: 10Ottomata)
[16:00:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[16:02:15] <wikibugs>	 (03PS9) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[16:02:31] <mutante>	 jynus: is this a duplicate thing? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/437382/
[16:02:47] <wikibugs>	 (03PS10) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[16:03:31] <jynus>	 no, that sould be kept
[16:03:34] <jynus>	 I already fixed that
[16:03:41] <jynus>	 by moving the contents
[16:03:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[16:03:50] <wikibugs>	 (03PS12) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298)
[16:05:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[16:05:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "I already fixed by moving the unrelated bits elsewhere, but mariadb maintenance for mediawiki should be kept there. It currently is empty," [puppet] - 10https://gerrit.wikimedia.org/r/437382 (owner: 10Dzahn)
[16:05:35] <icinga-wm>	 PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet last ran 9 hours ago
[16:06:05] <wikibugs>	 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#4276089 (10Gehel) 05Open>03Resolved
[16:06:58] <wikibugs>	 (03PS11) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[16:07:01] <wikibugs>	 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4276128 (10Gehel) 05Open>03Resolved
[16:07:03] <wikibugs>	 (03PS1) 10Cmjohnson: Snapshot1009, adding dhcpd and netboot [puppet] - 10https://gerrit.wikimedia.org/r/439958 (https://phabricator.wikimedia.org/T196189)
[16:07:33] <wikibugs>	 (03PS1) 10Volans: Updated src to v0.1.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439959 (https://phabricator.wikimedia.org/T191300)
[16:07:35] <wikibugs>	 (03PS1) 10Volans: Built wheels for v0.1.2 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439960 (https://phabricator.wikimedia.org/T191300)
[16:09:06] <wikibugs>	 (03CR) 10Volans: [V: 032 C: 032] Updated src to v0.1.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439959 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[16:09:23] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] Snapshot1009, adding dhcpd and netboot [puppet] - 10https://gerrit.wikimedia.org/r/439958 (https://phabricator.wikimedia.org/T196189) (owner: 10Cmjohnson)
[16:09:47] <wikibugs>	 (03PS2) 10Cmjohnson: Snapshot1009, adding dhcpd and netboot [puppet] - 10https://gerrit.wikimedia.org/r/439958 (https://phabricator.wikimedia.org/T196189)
[16:09:54] <wikibugs>	 (03PS2) 10Volans: Built wheels for v0.1.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439960 (https://phabricator.wikimedia.org/T191300)
[16:10:11] <wikibugs>	 (03CR) 10Volans: [V: 032 C: 032] Built wheels for v0.1.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439960 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans)
[16:10:25] <icinga-wm>	 RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[16:10:55] <wikibugs>	 (03PS12) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[16:11:02] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Snapshot1009, adding dhcpd and netboot [puppet] - 10https://gerrit.wikimedia.org/r/439958 (https://phabricator.wikimedia.org/T196189) (owner: 10Cmjohnson)
[16:11:11] <logmsgbot>	 !log volans@deploy1001 Started deploy [debmonitor/deploy@0eca14a]: Release v0.1.3
[16:11:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:33] <logmsgbot>	 !log volans@deploy1001 Finished deploy [debmonitor/deploy@0eca14a]: Release v0.1.3 (duration: 00m 22s)
[16:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:47] <wikibugs>	 (03PS1) 10Jcrespo: mariadb mediawiki maintenance: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/439961
[16:13:47] <wikibugs>	 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4276219 (10Imarlier)
[16:16:15] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active
[16:23:22] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4276270 (10fgiunchedi) A bigger nail in the coffin for GET requests is also going to be enabling caching by apache, at least for `listinfo` th...
[16:25:55] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404)
[16:27:31] <wikibugs>	 (03PS13) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783
[16:29:15] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[16:30:09] <wikibugs>	 (03CR) 10Paladox: "Puppet compiler results https://puppet-compiler.wmflabs.org/compiler02/11467/" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox)
[16:30:13] <wikibugs>	 (03PS13) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298)
[16:31:16] <wikibugs>	 (03PS1) 10Papaul: DNS: Add production DNS entries for bast2002 [dns] - 10https://gerrit.wikimedia.org/r/439965 (https://phabricator.wikimedia.org/T196665)
[16:31:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[16:39:25] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received
[16:40:25] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[16:42:16] <wikibugs>	 (03PS3) 10BBlack: esams rebalance: add 3043 to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936
[16:42:18] <wikibugs>	 (03PS1) 10BBlack: esams rebalance: remove 3043 from text [puppet] - 10https://gerrit.wikimedia.org/r/439967
[16:42:30] <wikibugs>	 (03PS14) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298)
[16:43:23] <wikibugs>	 (03CR) 10Paladox: "New date is friday as no one will be around on monday." [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox)
[16:44:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[16:44:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4276353 (10Cmjohnson)
[16:44:26] <wikibugs>	 10Operations, 10ops-eqiad, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4249646 (10Cmjohnson) 05Open>03Resolved
[16:45:57] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4276365 (10Bstorm)
[16:46:00] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#4276364 (10Bstorm) 05Open>03Resolved
[16:49:35] <wikibugs>	 (03PS1) 10ArielGlenn: add snapshot1009 as dumps testbed [puppet] - 10https://gerrit.wikimedia.org/r/439970
[16:49:56] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4276376 (10Cmjohnson) Still need add mac address to the dhcp file and the netboot.cfg. I just enabled the switch ports so once the...
[16:51:48] <wikibugs>	 (03PS1) 10Papaul: DNS: Add mgmt DNS entries for dns200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/439973 (https://phabricator.wikimedia.org/T196493)
[16:52:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298)
[16:53:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[16:53:18] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] add snapshot1009 as dumps testbed [puppet] - 10https://gerrit.wikimedia.org/r/439970 (owner: 10ArielGlenn)
[16:54:13] <marxarelli>	 !log starting branch cut for 1.32.0-wmf.8
[16:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:46] <wikibugs>	 (03PS2) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298)
[16:56:12] <wikibugs>	 (03PS1) 10ArielGlenn: add snapshot1009 to dumps scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/439975
[16:56:36] <apergos>	 if snapshot1009 whines it's being installed, please ignore
[16:57:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[16:57:42] <wikibugs>	 (03CR) 10ArielGlenn: [V: 032 C: 032] add snapshot1009 to dumps scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/439975 (owner: 10ArielGlenn)
[16:58:44] <wikibugs>	 10Operations, 10ops-codfw, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns200[12].wikimedia.org - https://phabricator.wikimedia.org/T196493#4276412 (10Papaul)
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1700).
[17:01:41] <wikibugs>	 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4276434 (10Gehel) @Eevans what do we need to do before uploading this to reprepro? I assume some coordination with @el...
[17:02:47] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff)
[17:05:40] <wikibugs>	 (03PS8) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871
[17:07:21] <wikibugs>	 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4276459 (10Eevans) >>! In T196044#4276434, @Gehel wrote: > @Eevans what do we need to do before uploading this to repr...
[17:07:24] <icinga-wm>	 PROBLEM - nutcracker process on snapshot1009 is CRITICAL: NRPE: Command check_nutcracker not defined
[17:07:44] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:07:53] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on snapshot1009 is CRITICAL: NRPE: Command check_ferm_active not defined
[17:07:53] <icinga-wm>	 PROBLEM - nutcracker port on snapshot1009 is CRITICAL: NRPE: Command check_nutcracker_port not defined
[17:09:53] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on snapshot1009 is OK: OK ferm input default policy is set
[17:11:46] <wikibugs>	 (03CR) 10Paladox: "Delayed until after the sre offsite." [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox)
[17:14:04] <icinga-wm>	 PROBLEM - puppet last run on snapshot1009 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 minute ago with 3 failures. Failed resources (up to 3 shown): Package[lilypond],Package[php-luasandbox],Package[dumps/dumps]
[17:16:55] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4276487 (10Cmjohnson) 05Open>03Resolved Thanks!
[17:19:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Power supply issue on maps1002 - https://phabricator.wikimedia.org/T196897#4276498 (10Cmjohnson) Your case was successfully submitted. Please note your Case ID: 5330129651 for future reference.
[17:25:49] <apergos>	 yeah we know about the puppet thing, ignore please
[17:27:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Bad disk on db1065 - https://phabricator.wikimedia.org/T196806#4276512 (10Marostegui) 05Open>03Resolved The new disk worked fine, thanks!! ``` root@db1065:~# megacli -LDPDInfo -aAll  Adapter #0  Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name...
[17:37:05] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@038c8b3]: sync after snapshot1009 install
[17:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:12] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@038c8b3]: sync after snapshot1009 install (duration: 00m 07s)
[17:37:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:56] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@038c8b3]: sync after snapshot1009 install
[17:37:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:01] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@038c8b3]: sync after snapshot1009 install (duration: 00m 04s)
[17:38:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:35] <apergos>	 almost there... one reboot to go
[17:40:33] <icinga-wm>	 PROBLEM - Host snapshot1009 is DOWN: PING CRITICAL - Packet loss = 100%
[17:41:17] <apergos>	 it's rebooting....
[17:41:33] <icinga-wm>	 RECOVERY - nutcracker port on snapshot1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[17:41:43] <icinga-wm>	 RECOVERY - Host snapshot1009 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[17:42:11] <wikibugs>	 (03PS1) 10Dduvall: Group0 to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439987
[17:42:14] <icinga-wm>	 RECOVERY - nutcracker process on snapshot1009 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[17:42:33] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1009 is OK: OK - running: The system is fully operational
[17:44:34] <icinga-wm>	 RECOVERY - puppet last run on snapshot1009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[17:48:33] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[17:51:43] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:53:18] <wikibugs>	 (03PS1) 10ArielGlenn: get snapshot1001 ready for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/439991
[17:57:28] <AaronSchulz>	 twentyafterfour: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/439778/
[17:58:10] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] get snapshot1001 ready for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/439991 (owner: 10ArielGlenn)
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1800)
[18:03:03] <logmsgbot>	 !log dduvall@deploy1001 Started scap: testwiki to php-1.32.0-wmf.8 and rebuild l10n cache
[18:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:45] <wikibugs>	 10Operations, 10Citoid, 10Code-Stewardship-Reviews, 10VisualEditor, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4276578 (10Jrbranaa) Added entry to developers/maintainers page.  Please augment with more accurate description and li...
[18:06:30] <wikibugs>	 10Operations, 10Citoid, 10Code-Stewardship-Reviews, 10VisualEditor, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4276587 (10Jrbranaa) >>! In T187194#4256587, @faidon wrote: > So we need to do //something// in a very short amount of...
[18:08:13] <icinga-wm>	 PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404)
[18:09:13] <icinga-wm>	 RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy
[18:12:31] <marxarelli>	 AaronSchulz: does that need to go out with the train?
[18:12:55] <marxarelli>	 that = https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/439778/
[18:13:49] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-ArielGlenn: decommission snapshot1001 - https://phabricator.wikimedia.org/T197021#4276597 (10ArielGlenn) p:05Triage>03Normal
[18:16:24] <icinga-wm>	 RECOVERY - Check systemd state on db1068 is OK: OK - running: The system is fully operational
[18:19:13] <AaronSchulz>	 marxarelli: would be nice (for T194403). It's not new to wmf8 though.
[18:19:13] <stashbot>	 T194403: Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index. - https://phabricator.wikimedia.org/T194403
[18:21:35] <marxarelli>	 AaronSchulz: kk. if you can get it reviewed/merged, i'll cherry-pick it to 1.32.0-wmf.8 and make sure it gets deployed
[18:21:53] <marxarelli>	 i'm chilling until the deploy window, so you have some time
[18:29:43] <jynus>	 AaronSchulz sorry if it looked like I was pressing you to do something, I wasn't
[18:30:16] <jynus>	 lately I am trying to be clear about ongoing errors to avoid missunderstandings
[18:30:44] <jynus>	 if the answer is "not a huge deal, will do at other time", it is ok too
[18:35:49] <AaronSchulz>	 I was trying to backport anyway :)
[18:36:47] <jynus>	 I would like to talk to you about roadmap of architecture- I think some things we do now will not work on multi-dc
[18:36:59] <jynus>	 (not now, but soon-ish)
[18:40:06] <wikibugs>	 (03PS1) 10Herron: mailman: whitelist icinga hosts from rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/439995 (https://phabricator.wikimedia.org/T196989)
[18:40:50] <wikibugs>	 (03PS1) 10Ottomata: Use Kafka main-eqiad for EventStreams service [puppet] - 10https://gerrit.wikimedia.org/r/439996 (https://phabricator.wikimedia.org/T185225)
[18:41:25] <wikibugs>	 (03CR) 10Herron: [C: 032] mailman: whitelist icinga hosts from rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/439995 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron)
[18:42:43] <logmsgbot>	 !log dduvall@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.8 and rebuild l10n cache (duration: 39m 39s)
[18:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:12] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:56:18] <icinga-wm>	 RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15500 bytes in 0.152 second response time
[18:57:02] <herron>	 !log restarted icinga service on einsteinium
[18:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 marxarelli: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1900).
[19:00:59] <greg-g>	 weeee
[19:03:01] <marxarelli>	 AaronSchulz: any update on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/439778/ ? train is leaving the station soon
[19:03:17] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational
[19:13:31] <wikibugs>	 (03CR) 10Dduvall: [C: 032] Group0 to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439987 (owner: 10Dduvall)
[19:15:03] <wikibugs>	 (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439987 (owner: 10Dduvall)
[19:16:53] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.8
[19:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:59] <wikibugs>	 (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439987 (owner: 10Dduvall)
[19:19:13] <wikibugs>	 (03PS1) 10Urbanecm: Allow bcts on private&fishbowl wikis advanced privilege manipulation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024)
[19:19:15] <wikibugs>	 (03PS1) 10Urbanecm: Clean legacy AddGroups/RemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440001 (https://phabricator.wikimedia.org/T197024)
[19:20:14] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] "440k :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm)
[19:20:56] <wikibugs>	 (03PS6) 10Zoranzoki21: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270)
[19:25:27] <wikibugs>	 (03PS1) 10Urbanecm: Some wikis bureacurats are able to grant non-grantable groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026)
[19:29:57] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test 
[19:29:57] <icinga-wm>	 from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 503 (expecting: 200)
[19:30:57] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy
[19:47:21] <bawolff>	 wow, the loading global options creates quite a lot of log messages in logstash
[19:49:11] <wikibugs>	 (03PS1) 10Ottomata: Add kafka_mirror_maker cert [labs/private] - 10https://gerrit.wikimedia.org/r/440008
[19:49:31] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Add kafka_mirror_maker cert [labs/private] - 10https://gerrit.wikimedia.org/r/440008 (owner: 10Ottomata)
[19:54:09] <wikibugs>	 (03PS1) 10Urbanecm: Make ProofreadPage operate on correct namespaces in pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440009 (https://phabricator.wikimedia.org/T197033)
[19:57:41] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4276984 (10mepps) Thank you @Dzahn! I'm currently trying to log into JupyterHub and my wikitech credentials aren't working. I just wanted to make sure I was ad...
[20:17:44] <Zoranzoki21>	 What should be done next, so ORES can be enabled on srwiki?
[20:28:04] <wikibugs>	 (03PS2) 10Herron: adds jforrester to deployment, deploy-service, & mobileapps-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/437819 (https://phabricator.wikimedia.org/T196566) (owner: 10RobH)
[20:29:02] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277046 (10herron) Thanks! Moving forward with the patch now.
[20:29:19] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277050 (10herron)
[20:29:48] <wikibugs>	 (03CR) 10Herron: [C: 032] adds jforrester to deployment, deploy-service, & mobileapps-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/437819 (https://phabricator.wikimedia.org/T196566) (owner: 10RobH)
[20:43:48] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3046 is CRITICAL: CRITICAL: expiry mailbox lag is 2032720
[20:51:32] <wikibugs>	 (03PS1) 10Ottomata: Regenerate all certificates that were signed by the now decommed puppetmaster02 [labs/private] - 10https://gerrit.wikimedia.org/r/440016 (https://phabricator.wikimedia.org/T195686)
[20:51:58] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Regenerate all certificates that were signed by the now decommed puppetmaster02 [labs/private] - 10https://gerrit.wikimedia.org/r/440016 (https://phabricator.wikimedia.org/T195686) (owner: 10Ottomata)
[20:52:06] <wikibugs>	 (03CR) 10MarcoAurelio: "I proposed this in the past and the question was 'did they asked for it?'. Well, on one hand I do not oppose this change. On the other, I " (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm)
[20:54:17] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received
[20:55:17] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[20:55:45] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277168 (10herron) 05Open>03Resolved a:03herron Access has been provisioned @Jdforrester-WMF  ``` deploy1001...
[20:55:50] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277171 (10herron)
[20:58:38] <AaronSchulz>	 marxarelli: no CR yet
[20:59:12] <wikibugs>	 (03CR) 10MarcoAurelio: Some wikis bureacurats are able to grant non-grantable groups (037 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) (owner: 10Urbanecm)
[21:03:58] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3046 is CRITICAL: CRITICAL: expiry mailbox lag is 2142389
[21:15:55] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 031] Fix wrong language in ur.wiktionary namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437974 (owner: 10Urbanecm)
[21:18:07] <wikibugs>	 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4277197 (10Cmjohnson) A new system board is required.  I will coordinate with HP to get this taken care of ASAP.  Required part is  775400-001      System I/O board (motherb...
[21:31:56] <wikibugs>	 (03CR) 10Imarlier: "Puppet compiler run looks right: https://puppet-compiler.wmflabs.org/compiler02/11468/" [puppet] - 10https://gerrit.wikimedia.org/r/439648 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier)
[21:32:39] <marlier>	 Anyone available to take a quick look and then merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439648/ ?  Literally a one line change... :-)
[21:32:48] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404)
[21:33:57] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[21:39:28] <wikibugs>	 (03PS2) 10BBlack: esams rebalance: remove 3043 from text [puppet] - 10https://gerrit.wikimedia.org/r/439967
[21:39:30] <wikibugs>	 (03PS4) 10BBlack: esams rebalance: add 3043 to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936
[21:39:41] <bblack>	 !log cp3043 - starting process to move to reimage into cache_upload
[21:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:05] <logmsgbot>	 !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp3043.esams.wmnet
[21:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:29] <wikibugs>	 (03CR) 10BBlack: [C: 032] esams rebalance: remove 3043 from text [puppet] - 10https://gerrit.wikimedia.org/r/439967 (owner: 10BBlack)
[21:46:25] <bblack>	 !log cp3046 - restart varnish backend for mbox lag
[21:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:05] <wikibugs>	 (03CR) 10Alex Monk: mediawiki::web::beta_sites: convert wikibooks to vhost (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439894 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[21:54:28] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3046 is OK: OK: expiry mailbox lag is 0
[22:00:28] <icinga-wm>	 PROBLEM - HHVM rendering on mw2139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:01:27] <icinga-wm>	 RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 76173 bytes in 0.330 second response time
[22:01:37] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received
[22:02:38] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[22:05:48] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test 
[22:05:48] <icinga-wm>	 from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404)
[22:06:41] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4277285 (10Bstorm) Hmm.  I'm coming up dry on how to find the MAC address in all the things here.  labstore1008/9.mgmt.eqiad.wmnet...
[22:10:27] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[22:19:14] <wikibugs>	 (03CR) 10BBlack: [C: 032] esams rebalance: add 3043 to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936 (owner: 10BBlack)
[22:19:28] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4277313 (10mmodell) @marostegui: I canceled some of the queued jobs which should have helped somewhat. The only thing I know to do beyond this is to stop replicating from gerrit.
[22:22:37] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4277317 (10Cmjohnson) Hrm, that's odd ....dns is setup and I setup idrac...I wondering if I forgot to connect the green mgmt cable....
[22:23:25] <twentyafterfour>	 !log phabricator: taking phd offline to relieve the load on the m3 database cluster
[22:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:58] <twentyafterfour>	 !log phabricator: I scheduled a 24 hour downtime in icinga for the phd service, to give me time to work on this issue. See T196840
[22:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:03] <stashbot>	 T196840: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840
[22:27:46] <wikibugs>	 (03PS3) 10Paladox: phabricator: Make phd.taskmasters configurable with hiera [puppet] - 10https://gerrit.wikimedia.org/r/439645
[22:33:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Power supply issue on maps1002 - https://phabricator.wikimedia.org/T196897#4277323 (10Cmjohnson)   Dear Christopher Johnson,  Hewlett Packard Enterprise Reference Number: 5330129651  STATUS: Customer Self Repair Part has been shipped  Part/s shipped: 754377-001 Part descr...
[22:33:25] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277329 (10Jdforrester-WMF) Thank you! Confirmed that I can log into deploy1001 in production now.
[22:37:15] <tzatziki>	 !log (from yesterday) resetting passwords for compromised accounts (T197046)
[22:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:59:33] <logmsgbot>	 !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp3043.esams.wmnet
[22:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T2300).
[23:00:04] <jouncebot>	 Zoranzoki21: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:45] <bblack>	 !log cp3043 - done, reimaged, in live service for cache_upload
[23:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:01] <Zoranzoki21>	 I am here :)
[23:05:53] <tzatziki>	 !log resetting passwords for compromised accounts (T197046)
[23:05:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:31] <Zoranzoki21>	 Is anyone who can swat active right now?
[23:13:38] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4277513 (10herron) >>! In T196989#4276270, @fgiunchedi wrote: > A bigger nail in the coffin for GET requests is also going to be enabling cach...
[23:15:04] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4277521 (10mmodell) I'm deleting queued jobs in batches of 100,000. I've also reduced the number of phabricator workers to 5 (from 10) so overall there should be a reduction in...
[23:15:28] <icinga-wm>	 PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a
[23:15:28] <icinga-wm>	 ved
[23:16:37] <icinga-wm>	 RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy
[23:21:07] <icinga-wm>	 PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received
[23:22:07] <icinga-wm>	 RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy
[23:24:02] <James_F>	 Is anyone SWATing?
[23:24:35] <Zoranzoki21>	 I wait same
[23:24:44] <Zoranzoki21>	 James_F: Hi, I wait same. I no know it
[23:36:43] <Niharika>	 Zoranzoki21: I can SWAT now if you are still here.
[23:36:47] <Zoranzoki21>	 I am here
[23:36:50] <Zoranzoki21>	 Can you?
[23:36:53] <Niharika>	 Great. Yup.
[23:37:24] <Zoranzoki21>	 Niharika: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/436211/
[23:38:39] <Niharika>	 James_F: Have you got a minute to look over the patch? It seems sane but I'm not sure if people need to approve/have consensus before those domains can be added.
[23:39:07] <Zoranzoki21>	 Niharika: You talk for my or another patch?
[23:39:10] <James_F>	 Sure.
[23:39:42] <Niharika>	 Zoranzoki21: Yours. 
[23:40:13] <James_F>	 Niharika: Other than the whitespace issues, looks fine to me.
[23:40:53] <Zoranzoki21>	 James_F: Which whitespace issues?
[23:42:04] <wikibugs>	 (03PS7) 10Niharika29: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) (owner: 10Zoranzoki21)
[23:42:11] <Niharika>	 Alright, fixed them. 
[23:42:24] <Niharika>	 Zoranzoki21: The comments were misaligned. 
[23:42:28] <Niharika>	 Thanks James_F.
[23:42:37] <wikibugs>	 (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) (owner: 10Zoranzoki21)
[23:43:11] <Zoranzoki21>	 Niharika: Oh it.. Ok, thank you for fix and deploying
[23:44:21] <wikibugs>	 (03Merged) 10jenkins-bot: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) (owner: 10Zoranzoki21)
[23:47:48] <logmsgbot>	 !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons T195270, T195928 (duration: 00m 59s)
[23:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:53] <stashbot>	 T195270: Please add <http://journals.plos.org> and <https://pensoft.net> to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T195270
[23:47:53] <stashbot>	 T195928: Please add Chilean government websites to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T195928
[23:47:57] <wikibugs>	 (03CR) 10jenkins-bot: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) (owner: 10Zoranzoki21)
[23:48:32] <Zoranzoki21>	 Thank you very much!
[23:49:02] <Niharika>	 Zoranzoki21: you're welcome. :)
[23:49:40] <Zoranzoki21>	 Good night