[00:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T0000).
[00:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:54] <jdlrobson>	 \o
[00:01:42] <wikibugs>	 (03PS1) 10Ppchelko: Remove wmgDebugJobQueueEventBus config parameter. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888
[00:04:10] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Phabricator: Add translations library to phabricator profile [puppet] - 10https://gerrit.wikimedia.org/r/404887 (https://phabricator.wikimedia.org/T225) (owner: 1020after4)
[00:04:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "cool project to translate phab with translatewiki.net and cute low ticket number" [puppet] - 10https://gerrit.wikimedia.org/r/404887 (https://phabricator.wikimedia.org/T225) (owner: 1020after4)
[00:05:34] <twentyafterfour>	 Thanks mutante!
[00:05:46] <jdlrobson>	 twentyafterfour: are you able to swat?
[00:05:57] <twentyafterfour>	 jdlrobson: I can
[00:06:03] <jdlrobson>	 thank you :)
[00:06:23] <mutante>	 twentyafterfour: yw
[00:06:59] <twentyafterfour>	 jdlrobson: in the order you put them on the deploy calendar?
[00:07:11] <jdlrobson>	 twentyafterfour: yes please
[00:12:34] <twentyafterfour>	 waiting for jenkins...
[00:14:39] <jdlrobson>	 :)
[00:18:19] <wikibugs>	 (03PS1) 10Chad: WIP: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892
[00:20:08] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM! Thanks for the fixes, and feel free to ignore my nitpicks comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani)
[00:20:40] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10RobH) p:05Triage>03Normal
[00:21:10] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908288 (10RobH) The port mappings for onsite use on the existing mr1-eqiad (and likely to match on new device unless @ayounsi advises otherwise):  ge-0/0/0  Core: msw1-eqiad:ge-0/0/32 ge-0/0/1  Core: asw-a-eqi...
[00:23:04] <twentyafterfour>	 jdlrobson: ok I've sync'd with mwdebug1002 can you test there?
[00:23:39] <jdlrobson>	 twentyafterfour: on it..
[00:24:41] <jdlrobson>	 twentyafterfour: looks good to me
[00:25:07] <twentyafterfour>	 ok do we need to test the wmf.17 patch separately?
[00:25:33] <twentyafterfour>	 if not I'll sync them both out together
[00:26:16] <jdlrobson>	 sync together should be fine
[00:26:35] <wikibugs>	 (03CR) 1020after4: [C: 032] Use the correct Pashto Wikipedia wordmark on mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404828 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson)
[00:26:54] <twentyafterfour>	 I'll have you test the 3rd patch then I'll sync all 3
[00:29:14] <wikibugs>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/9773/" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad)
[00:29:30] <wikibugs>	 (03Merged) 10jenkins-bot: Use the correct Pashto Wikipedia wordmark on mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404828 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson)
[00:29:43] <wikibugs>	 (03CR) 10jenkins-bot: Use the correct Pashto Wikipedia wordmark on mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404828 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson)
[00:30:20] <wikibugs>	 (03PS2) 10Chad: WIP: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892
[00:31:00] <twentyafterfour>	 jdlrobson: ok the pashto wordmark patch is on mwdebug1002
[00:31:07] <jdlrobson>	 twentyafterfour: on it.
[00:32:02] <jdlrobson>	 twentyafterfour: gimme 5 mins to confirm with a designer :)
[00:32:44] <twentyafterfour>	 ok
[00:33:49] <jdlrobson>	 twentyafterfour good to go! https://usercontent.irccloud-cdn.com/file/V7oVzShb/image.png
[00:33:57] <jdlrobson>	 ahh gif to png fail
[00:34:09] <jdlrobson>	 https://media1.giphy.com/media/xUOxf4Qm1KX6RuUuxa/giphy-downsized.gif
[00:35:32] <logmsgbot>	 !log twentyafterfour@tin Started scap: Evening SWAT
[00:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:27] <jdlrobson>	 thx twentyafterfour :)
[00:40:06] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3908314 (10Tgr) ``` tgr@deployment-mediawiki04:~$ telnet deployment-redis01.deployment-prep.eqiad.wmflabs 6379 Trying 10.68.16.177... telnet: Un...
[00:41:15] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0],Exec[absent_ensure_members]
[00:42:55] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "looks good. rsync on tin, cron on releases1001, none releases2001" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad)
[00:45:02] <twentyafterfour>	 jdlrobson: you're welcome
[00:45:22] <jdlrobson>	 twentyafterfour: scap is still running right?
[00:47:47] <twentyafterfour>	 right
[00:48:08] <twentyafterfour>	 it's on canaries now
[00:49:32] <twentyafterfour>	 jdlrobson: now it's syncing proxies so it's almost there
[00:55:05] <twentyafterfour>	 ok it's done syncing, just doing cdb rebuild now
[00:58:05] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:58:24] <icinga-wm>	 PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:58:31] <jdlrobson>	 twentyafterfour: awesome.. seeing the logo etc now
[00:58:34] <icinga-wm>	 PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:59:04] <icinga-wm>	 PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:59:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:59:35] <icinga-wm>	 PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:59:44] <icinga-wm>	 PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:00:01] <logmsgbot>	 !log twentyafterfour@tin Finished scap: Evening SWAT (duration: 24m 29s)
[01:00:04] <jouncebot>	 twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T0100).
[01:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[01:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:56] <twentyafterfour>	 jdlrobson: sweet, that concludes this Evening SWAT.
[01:01:15] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[01:02:00] <twentyafterfour>	 !log Evening SWAT completed. Starting phabricator deployment of #phabricator-2018-07-17 [release/2017-01-17/1]
[01:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:02:44] <icinga-wm>	 PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer
[01:02:45] <icinga-wm>	 RECOVERY - DPKG on pybal-test2001 is OK: All packages OK
[01:03:04] <icinga-wm>	 RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient
[01:03:05] <icinga-wm>	 RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up
[01:03:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set
[01:03:25] <icinga-wm>	 RECOVERY - Disk space on pybal-test2001 is OK: DISK OK
[01:03:34] <icinga-wm>	 RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational
[01:03:35] <icinga-wm>	 RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full
[01:03:44] <icinga-wm>	 RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[01:04:09] <twentyafterfour>	 Phabricator will be offline for a short time.
[01:08:33] <twentyafterfour>	 !log phabricator deployment finished without incident.
[01:08:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:15] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[01:13:00] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3908373 (10Tgr) So this is partially my fault, sorry :/ I started redis with `sudo service redis-server start` but the redis service that is pro...
[01:26:04] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3908393 (10Tgr) p:05High>03Normal a:03Tgr >>! In T185055#3908373, @Tgr wrote: > nutcracker is still dead :(  It just needed a restart, so...
[01:31:18] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3908399 (10Tgr) Also I have zero clue why `redis-cli -s /var/run/nutcracker/redis_eqiad.sock -a <correct password>` and `redis-cli -s /var/run/n...
[02:27:50] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.16) (duration: 07m 18s)
[02:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:25:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 863.40 seconds
[03:56:44] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 196.05 seconds
[04:07:04] <icinga-wm>	 PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:07:15] <icinga-wm>	 PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:07:24] <icinga-wm>	 PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:07:34] <icinga-wm>	 PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:07:44] <icinga-wm>	 PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:07:45] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:07:54] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:08:15] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:10:14] <icinga-wm>	 PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer
[04:14:24] <icinga-wm>	 RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational
[04:14:34] <icinga-wm>	 RECOVERY - DPKG on pybal-test2001 is OK: All packages OK
[04:14:44] <icinga-wm>	 RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient
[04:14:54] <icinga-wm>	 RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up
[04:14:54] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set
[04:15:04] <icinga-wm>	 RECOVERY - Disk space on pybal-test2001 is OK: DISK OK
[04:15:15] <icinga-wm>	 RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[04:15:15] <icinga-wm>	 RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full
[04:18:15] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures
[06:03:55] <wikibugs>	 (03PS1) 10Groovier1: Adding config for WikimediaEvents module for logging behaviour data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404910
[06:07:54] <wikibugs>	 (03PS1) 10Rxy: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182)
[06:12:52] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912
[06:12:56] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912
[06:15:30] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912 (owner: 10Marostegui)
[06:16:57] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912 (owner: 10Marostegui)
[06:17:07] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912 (owner: 10Marostegui)
[06:18:42] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 - T174569 (duration: 01m 13s)
[06:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:58] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[06:21:14] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404913 (https://phabricator.wikimedia.org/T174569)
[06:23:34] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404913 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[06:24:59] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404913 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[06:27:00] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404913 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[06:27:02] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 - T174569 (duration: 01m 12s)
[06:27:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:16] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[06:27:27] <marostegui>	 !log Deploy schema change on s8 db1087 (sanitarium master) with replication (this will generate lag on labs servers) - T174569
[06:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:18] <icinga-wm>	 RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:36:04] <wikibugs>	 (03PS2) 10Rxy: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182)
[06:52:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy)
[06:53:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy)
[06:59:30] <wikibugs>	 (03PS3) 10Rxy: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182)
[07:01:46] <wikibugs>	 (03CR) 10Rxy: "> Uploaded patch set 3." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy)
[07:05:38] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy)
[07:07:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy)
[07:10:28] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:40:19] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:44:45] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add Python 3 support [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 (owner: 10Volans)
[08:02:48] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0
[08:08:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Also add samwalton9, samtar to absented group [puppet] - 10https://gerrit.wikimedia.org/r/404933
[08:09:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Also add samwalton9, samtar to absented group [puppet] - 10https://gerrit.wikimedia.org/r/404933 (owner: 10Muehlenhoff)
[08:16:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0
[08:17:49] <wikibugs>	 (03CR) 10Ema: [C: 032] Add unit test cases for Server [debs/pybal] - 10https://gerrit.wikimedia.org/r/404704 (owner: 10Mark Bergsma)
[08:18:48] <wikibugs>	 (03CR) 10Ema: [C: 032] Separate out coordinator.Server into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/404713 (owner: 10Mark Bergsma)
[08:19:26] <wikibugs>	 (03CR) 10Ema: [C: 032] Expand test coverage of server.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/404762 (owner: 10Mark Bergsma)
[08:27:28] <wikibugs>	 (03CR) 10Gergő Tisza: Adding config for WikimediaEvents module for logging behaviour data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404910 (owner: 10Groovier1)
[08:27:35] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 04-1] Adding config for WikimediaEvents module for logging behaviour data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404910 (owner: 10Groovier1)
[08:30:20] <moritzm>	 !log reboot iron for kernel security update
[08:30:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:21] <godog>	 !log bootstrap cassandra-c on restbase1013
[08:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:17] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404936 (https://phabricator.wikimedia.org/T162807)
[08:44:34] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404936 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[08:46:07] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404936 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[08:46:27] <wikibugs>	 (03CR) 10Ema: "A few comments, looks good in general." (034 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) (owner: 10Mark Bergsma)
[08:46:49] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404936 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[08:51:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron)
[08:52:32] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067  - T162807 (duration: 01m 13s)
[08:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:46] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[08:59:05] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3908702 (10jcrespo)
[09:02:09] <wikibugs>	 (03PS3) 10Filippo Giunchedi: restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100)
[09:03:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[09:05:07] <wikibugs>	 (03PS1) 10Ema: cache_upload: use resp.reason in vtc test cases [puppet] - 10https://gerrit.wikimedia.org/r/404940 (https://phabricator.wikimedia.org/T180433)
[09:07:11] <marostegui>	 !log Stop replication in sync db1089 db1067 - T162807
[09:07:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:23] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[09:07:35] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_upload: use resp.reason in vtc test cases [puppet] - 10https://gerrit.wikimedia.org/r/404940 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema)
[09:09:43] <wikibugs>	 (03PS1) 10Lokal Profil: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981)
[09:10:01] <akosiaris>	 !log reboot alcyone pollux sca2004 poolcounter2002 serpens for PCID/INVPCID CPU feature enabling
[09:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil)
[09:12:05] <wikibugs>	 (03PS2) 10Lokal Profil: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981)
[09:14:54] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Remove db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888)
[09:15:27] <wikibugs>	 (03PS2) 10Marostegui: db-codfw.php: Remove db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888)
[09:19:09] <wikibugs>	 (03PS1) 10Ema: cache_upload: upgrade cp3034 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404944 (https://phabricator.wikimedia.org/T180433)
[09:20:35] <akosiaris>	 !log reboot oresrdb2001 for PCID/INVPCID CPU feature enabling
[09:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:27] <elukey>	 !log reboot druid1001 for kernel upgrades
[09:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:02] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_upload: upgrade cp3034 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404944 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema)
[09:25:48] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp3034 is CRITICAL: connect to address 10.20.0.169 and port 3122: Connection refused
[09:26:00] <ema>	 that's me, the host is depooled ^
[09:26:51] <jynus>	 !log reimage es2003 to stretch
[09:27:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:48] <marostegui>	 !log !log Stop replication in sync db1089 and db2048 (codfw master) - T162807
[09:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:59] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[09:28:55] <icinga-wm>	 PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 14 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[cassandra],Package[cassandra/metrics-collector],Package[restbase/deploy]
[09:31:47] <wikibugs>	 10Operations, 10Scap: scap sudo violation on first puppet run - https://phabricator.wikimedia.org/T185189#3908798 (10fgiunchedi)
[09:32:25] <icinga-wm>	 PROBLEM - Restbase root url on restbase2012 is CRITICAL: connect to address 10.192.48.67 and port 7231: Connection refused
[09:32:25] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2010 is CRITICAL: NRPE: Command check_cassandra-a-state not defined
[09:32:25] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.32.152:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.152 and port 9042: Connection refused
[09:32:56] <icinga-wm>	 PROBLEM - puppet last run on restbase2011 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 16 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[restbase/deploy],Service[cassandra-a],Service[cassandra-b],Service[cassandra-c]
[09:34:06] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused
[09:34:06] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.68 and port 9042: Connection refused
[09:34:06] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.32.152:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:34:46] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp3034 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.168 second response time
[09:35:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: scap: require sudo rules to be in place before deploy [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189)
[09:35:46] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.187:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:35:46] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:35:47] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2011 is CRITICAL: NRPE: Command check_cassandra-a-state not defined
[09:36:15] <wikibugs>	 (03PS3) 10Marostegui: db-codfw.php: Remove db2034 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888)
[09:37:35] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.153 and port 9042: Connection refused
[09:37:35] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2010 is CRITICAL: NRPE: Command check_cassandra-b-state not defined
[09:37:35] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2012 is CRITICAL: NRPE: Command check_cassandra-a-state not defined
[09:38:34] <elukey>	 !log reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites
[09:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:15] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused
[09:39:15] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused
[09:39:16] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.153:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:40:56] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:40:56] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:40:56] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2011 is CRITICAL: NRPE: Command check_cassandra-b-state not defined
[09:40:58] <wikibugs>	 (03PS2) 10Volans: wmf-auto-reimage: fix host validation logic [puppet] - 10https://gerrit.wikimedia.org/r/404439 (https://phabricator.wikimedia.org/T182702)
[09:41:06] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404946
[09:41:36] <elukey>	 mmm is this expired downtime godog --^ ?
[09:41:53] <godog>	 ugh, yeah thanks elukey 
[09:42:14] <elukey>	 ah super, it was scary :D
[09:42:36] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused
[09:42:37] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2010 is CRITICAL: NRPE: Command check_cassandra-c-state not defined
[09:42:37] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2012 is CRITICAL: NRPE: Command check_cassandra-b-state not defined
[09:43:13] <ema>	 !log cache_upload: repooled cp3034 running varnish 5 
[09:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:02] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "this is ok, assuming no_raise is set on --no-validate" [puppet] - 10https://gerrit.wikimedia.org/r/404439 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans)
[09:45:24] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404946 (owner: 10Marostegui)
[09:46:31] <wikibugs>	 (03CR) 10Volans: [C: 032] "yes, indeed. Thanks for the review." [puppet] - 10https://gerrit.wikimedia.org/r/404439 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans)
[09:48:19] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404946 (owner: 10Marostegui)
[09:48:30] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404946 (owner: 10Marostegui)
[09:49:56] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067  - T162807 (duration: 01m 12s)
[09:50:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:09] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[09:53:26] <icinga-wm>	 RECOVERY - Restbase root url on restbase2012 is OK: HTTP OK: HTTP/1.1 200 - 15785 bytes in 0.084 second response time
[09:58:21] <akosiaris>	 !log reboot etcd1006 for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide)
[09:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:53] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10MoritzMuehlenhoff) This is solely for T174110 or are we anticipating other use cases?   The high level implementation idea seem...
[10:07:39] <moritzm>	 !log rebooting rdb1002/rdb1004/rdb1006/rdb1008 for kernel security update
[10:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:52] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@04e7cdb]: Use stable packge names, normalise cache-control headers, update top definition - T184199 T184833 T184541
[10:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:07] <stashbot>	 T184833: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833
[10:08:07] <stashbot>	 T184199: Discontinue the Cassandra, Sqlite and Spec -ng packages - https://phabricator.wikimedia.org/T184199
[10:08:07] <stashbot>	 T184541: Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541
[10:10:20] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@04e7cdb]: Use stable packge names, normalise cache-control headers, update top definition - T184199 T184833 T184541 (duration: 02m 29s)
[10:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:27] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[bump nf_conntrack hash table size]
[10:12:53] <logmsgbot>	 !log mobrovac@tin Started deploy [restbase/deploy@5c353f7]: Use stable packge names, normalise cache-control headers, update top definition, take #2 - T184199 T184833 T184541
[10:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:50] <mark>	 ]]
[10:17:11] <wikibugs>	 10Operations, 10User-Elukey: Sporadic logrotate issue for stretch mediawiki appservers - https://phabricator.wikimedia.org/T185195#3908938 (10elukey) p:05Triage>03Normal
[10:17:29] <wikibugs>	 10Operations, 10User-Elukey: Sporadic logrotate issue for stretch mediawiki appservers - https://phabricator.wikimedia.org/T185195#3908950 (10elukey)
[10:17:32] <wikibugs>	 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3908951 (10elukey)
[10:19:06] <wikibugs>	 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3908967 (10elukey) @faidon whenever you have time do you mind to explain a bit what data is currently pushed to the netflow...
[10:20:09] <icinga-wm>	 PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[10:20:18] <icinga-wm>	 PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[10:20:19] <icinga-wm>	 PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[10:20:19] <icinga-wm>	 PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[10:20:28] <icinga-wm>	 PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[10:20:38] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[10:20:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[10:21:28] <icinga-wm>	 PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer
[10:22:09] <icinga-wm>	 RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full
[10:22:18] <icinga-wm>	 RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational
[10:22:19] <icinga-wm>	 RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient
[10:22:19] <icinga-wm>	 RECOVERY - DPKG on pybal-test2001 is OK: All packages OK
[10:22:29] <icinga-wm>	 RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[10:22:29] <icinga-wm>	 RECOVERY - Disk space on pybal-test2001 is OK: DISK OK
[10:22:38] <icinga-wm>	 RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up
[10:22:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set
[10:25:11] <logmsgbot>	 !log mobrovac@tin Finished deploy [restbase/deploy@5c353f7]: Use stable packge names, normalise cache-control headers, update top definition, take #2 - T184199 T184833 T184541 (duration: 12m 18s)
[10:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:25] <stashbot>	 T184833: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833
[10:25:25] <stashbot>	 T184199: Discontinue the Cassandra, Sqlite and Spec -ng packages - https://phabricator.wikimedia.org/T184199
[10:25:25] <stashbot>	 T184541: Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541
[10:25:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Remove db2034 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui)
[10:27:33] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Remove db2034 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui)
[10:27:36] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10akosiaris) Ι 'll echo Moritz on this one. It does look like adding system users to the admin module adds some complexity and do...
[10:27:45] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Remove db2034 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui)
[10:29:27] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db2034 from s1 as it will be in x1 - T184888 (duration: 01m 12s)
[10:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:40] <stashbot>	 T184888: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888
[10:30:50] <akosiaris>	 !log reboot etherpad1001 for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide)
[10:31:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:09] <icinga-wm>	 PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:09] <icinga-wm>	 PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:09] <icinga-wm>	 PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:09] <icinga-wm>	 PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:09] <icinga-wm>	 PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:28] <icinga-wm>	 PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:32:39] <icinga-wm>	 PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:09] <icinga-wm>	 PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:38] <icinga-wm>	 PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:38] <icinga-wm>	 PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:34:19] <icinga-wm>	 PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:34:19] <icinga-wm>	 PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:34:19] <icinga-wm>	 PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:34:38] <icinga-wm>	 PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:34:38] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:35:43] <Amir1>	 !log ladsgroup@terbium:/srv/mediawiki/php-1.31.0-wmf.17$ mwscript extensions/WikibaseQualityConstraints/maintenance/ImportConstraintStatements.php  --wiki wikidatawiki (T184720)
[10:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:56] <stashbot>	 T184720: Re-import constraints from statements - https://phabricator.wikimedia.org/T184720
[10:36:53] <wikibugs>	 (03PS1) 10Elukey: Allow to set the JAVA_HOME env variable in hadoop/hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/404954 (https://phabricator.wikimedia.org/T166248)
[10:38:24] <wikibugs>	 10Operations, 10Performance-Team (Radar): Add profiling for Varnish and VCL - https://phabricator.wikimedia.org/T175710#3909125 (10Krinkle)
[10:40:21] <wikibugs>	 10Operations, 10Page Content Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3909134 (10mobrovac) 05Open>03Resolved a:03Pchelolo It seems @Pchelolo's no...
[10:41:37] <wikibugs>	 (03PS2) 10Elukey: Allow to set the JAVA_HOME env variable in hadoop/hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/404954 (https://phabricator.wikimedia.org/T166248)
[10:42:28] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:45:19] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3909156 (10Krinkle)
[10:45:49] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow db2034 reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/404955 (https://phabricator.wikimedia.org/T184888)
[10:49:58] <wikibugs>	 (03PS2) 10Marostegui: install_server: Allow db2034 reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/404955 (https://phabricator.wikimedia.org/T184888)
[10:57:27] <akosiaris>	 !log reboot actinium.wikimedia.org aluminium.wikimedia.org argon.eqiad.wmnet boron.eqiad.wmnet bromine.eqiad.wmnet darmstadtium.eqiad.wmnet dbmonitor1001.wikimedia.org dubnium.wikimedia.org dysprosium.wikimedia.org etcd1001.eqiad.wmnet etcd1004.eqiad.wmnet fermium.wikimedia.org hassium.eqiad.wmnet kubestagetcd1001.eqiad.wmnet logstash1007.eqiad.wmnet meitnerium.wikimedia.org mendelevium.eqiad.wmnet mwdebug1002.eqiad.wmnet m
[10:57:27] <akosiaris>	 x1001.wikimedia.org neon.eqiad.wmnet netmon1003.wikimedia.org planet1001.eqiad.wmnet poolcounter1001.eqiad.wmnet releases1001.eqiad.wmnet roentgenium.eqiad.wmnet rutherfordium.eqiad.wmnet sca1003.eqiad.wmnet ununpentium.wikimedia.org for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide)
[10:57:31] <akosiaris>	 heads up ^
[10:57:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:08] <icinga-wm>	 RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[10:58:38] <icinga-wm>	 RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[10:59:18] <icinga-wm>	 RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:59:19] <icinga-wm>	 RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[10:59:38] <icinga-wm>	 RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:59:38] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[11:01:09] <icinga-wm>	 RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:02:08] <icinga-wm>	 RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:02:08] <icinga-wm>	 RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:02:09] <icinga-wm>	 RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:02:09] <icinga-wm>	 RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:02:28] <icinga-wm>	 RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:02:39] <icinga-wm>	 RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:02:56] <wikibugs>	 10Operations, 10User-Elukey: Sporadic logrotate issue for stretch mediawiki appservers - https://phabricator.wikimedia.org/T185195#3909201 (10Volans) FYI https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881725
[11:03:38] <icinga-wm>	 RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:04:18] <icinga-wm>	 RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:05:18] <icinga-wm>	 PROBLEM - etc request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 58689 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[11:06:19] <icinga-wm>	 RECOVERY - etc request latencies on argon is OK: OK - etcd_request_latencies is 1964 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[11:06:27] <volans>	 !log disabled puppet on tegmen to test impact on puppetdb - T170740
[11:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:38] <stashbot>	 T170740: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740
[11:22:07] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] "@Ppchelko, could you SWAT this today? Even though we could get it in together with the next migration." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko)
[11:34:27] <akosiaris>	 !log reboot logstash1008 etcd1002  kubestagetcd1002.eqiad.wmnet for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide)
[11:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:06] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: apt: apt-upgrades: dont fail if new packages are being installed [puppet] - 10https://gerrit.wikimedia.org/r/404963 (https://phabricator.wikimedia.org/T178717)
[11:38:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrades: dont fail if new packages are being installed [puppet] - 10https://gerrit.wikimedia.org/r/404963 (https://phabricator.wikimedia.org/T178717) (owner: 10Arturo Borrero Gonzalez)
[11:46:28] <yannf>	 hi
[11:46:38] <yannf>	 https://fr.wikisource.org/wiki/Le_Roi_d%E2%80%99Yvetot
[11:46:58] <yannf>	 any idea what's going on here? ^
[11:47:12] <yannf>	 [WmCIHQpAMFAAAES6KqkAAACW] 2018-01-18 11:42:24: Erreur fatale de type « InvalidArgumentException »
[11:47:28] <yannf>	 it's the first time I see such a message
[11:50:46] <yannf>	 https://phabricator.wikimedia.org/T185204
[11:51:55] <_joe_>	 yannf: taking a look
[11:52:12] <volans>	 php-1.31.0-wmf.17/includes/filebackend/FileBackendGroup.php:183
[11:52:59] <_joe_>	 yeah looks like a MediaWiki failure
[11:53:29] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3909310 (10elukey) 05Open>03Resolved Closing task since https://grafana.wikimedia.org/dashboard/db/puppetdb is almost a replica of the...
[11:54:07] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740#3909312 (10elukey) The puppetdb grafana dashboard (and its related monitoring config for nitrogen/nihal) were added in https://phabricator.wikimedia.org/T184796
[11:59:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967
[12:01:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (owner: 10Muehlenhoff)
[12:19:30] <wikibugs>	 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3909349 (10Deskana) I don't want deployment of TemplateStyles to be a moving target—as it has been for many months—so I'm going to stick...
[12:21:01] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3909354 (10jcrespo)
[12:29:19] <icinga-wm>	 PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[12:30:42] <wikibugs>	 (03PS5) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647
[12:36:20] <akosiaris>	 !log reboot chlorine.eqiad.wmnet etcd1003.eqiad.wmnet etcd1005.eqiad.wmnet fermium.wikimedia.org install1002.wikimedia.org krypton.eqiad.wmnet kubestagetcd1003.eqiad.wmnet logstash1009.eqiad.wmnet mwdebug1001.eqiad.wmnet sca1004.eqiad.wmnet  for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide)
[12:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:58] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on install1002 is CRITICAL: Return code of 255 is out of bounds
[12:39:08] <icinga-wm>	 PROBLEM - Squid on install1002 is CRITICAL: connect to address 208.80.154.22 and port 8080: Connection refused
[12:39:28] <icinga-wm>	 PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached.
[12:39:59] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on install1002 is OK: OK ferm input default policy is set
[12:40:08] <icinga-wm>	 RECOVERY - Squid on install1002 is OK: TCP OK - 0.000 second response time on 208.80.154.22 port 8080
[12:40:40] <elukey>	 !log set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot)
[12:40:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:28] <icinga-wm>	 RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational
[12:43:03] <akosiaris>	 !log disable puppet across the fleet for nitrogen (puppetdb) reboot
[12:43:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:15] * akosiaris <3 cumin
[12:43:52] <volans>	 yw :)
[12:43:58] <elukey>	 !log bohrium rebooted for kernel upgrades
[12:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:32] <akosiaris>	 !log reboot seaborgium for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide)
[12:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:41] <wikibugs>	 (03CR) 10Marostegui: [C: 032] install_server: Allow db2034 reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/404955 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui)
[12:50:46] <wikibugs>	 (03PS3) 10Marostegui: install_server: Allow db2034 reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/404955 (https://phabricator.wikimedia.org/T184888)
[12:56:37] <akosiaris>	 !log reboot nitrogen for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide)
[12:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:58] <akosiaris>	 !log enable puppet across the fleet after nitrogen (puppetdb) reboot
[12:58:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:14] <akosiaris>	 ok I think we are done with KVM vms
[12:58:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db2050 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused)
[12:58:48] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db2036 is CRITICAL: CRITICAL slave_io_state could not connect
[12:58:48] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore2001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused)
[12:58:48] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db2057 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused)
[12:59:06] <akosiaris>	 marostegui: jynus: ^ known ?
[12:59:09] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on db2036 is CRITICAL: CRITICAL slave_sql_state could not connect
[12:59:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db2043 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused)
[12:59:18] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db2018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (104 Connection reset by peer)
[12:59:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db2074 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused)
[13:00:13] <akosiaris>	 connection refused even on the unix socket ?
[13:00:31] <akosiaris>	 power mysqld is in the D state
[13:01:00] <marostegui>	 Checking
[13:01:50] <moritzm>	 akosiaris: webperf1001 and poolcounter2001 are KVM instances and still need a reboot
[13:01:50] <marostegui>	 mysql crashed
[13:01:58] <marostegui>	 it is doing recovery
[13:02:10] <akosiaris>	 marostegui: ok
[13:02:28] <akosiaris>	 moritzm: ok the first one is easy, the second wants a minor depooling, will do
[13:02:40] <akosiaris>	 thanks!
[13:03:09] <moritzm>	 we've usually just rebooted the codfw poolcounters without depooling
[13:03:15] <akosiaris>	 !log reboot webperf1001 for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide)
[13:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:28] <jynus>	 did s3 master went down?
[13:03:44] <akosiaris>	 jynus: db2036 ? yes
[13:03:47] <akosiaris>	 mysql crash
[13:03:48] <marostegui>	 yes
[13:03:54] <jynus>	 not sure we should put it up
[13:04:10] <jynus>	 I think it has no gtid, and we should promote other master instead
[13:04:20] <marostegui>	 jynus: can it be related to the compare.py running?
[13:04:25] <akosiaris>	 moritzm: true. I 'll do so now
[13:04:25] <jynus>	 it could be
[13:04:36] <jynus>	 it wasn't running now
[13:04:52] <jynus>	 but activity == corruption/crash
[13:05:16] <akosiaris>	 !log reboot poolcounter2001 for PCID/INVPCID CPU feature enabling
[13:05:19] <jynus>	 despire execution thers stopped at 2018-01-18T12:59:27.299959
[13:05:23] <moritzm>	 akosiaris: I'll take care of poolcounter1002 (the only remaining baremetal poolcounter), but CI throws a weird error on https://gerrit.wikimedia.org/r/#/c/404967/
[13:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:28] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag could not connect
[13:05:45] <jynus>	 marostegui: let's stop all replicas there
[13:05:48] <marostegui>	 https://phabricator.wikimedia.org/P6611
[13:06:22] <jynus>	 that is my query
[13:06:22] <marostegui>	 I changed my.cnf to make it start with replication stopped as it was starting
[13:06:38] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.94 seconds
[13:06:58] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.40 seconds
[13:07:18] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.88 seconds
[13:07:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 661.23 seconds
[13:07:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.47 seconds
[13:07:42] <jynus>	 but look at the query time
[13:08:05] <jynus>	 ah, so yes, it could be related
[13:08:38] <jynus>	 or more likely, it could be a trigger
[13:08:49] <marostegui>	 All the slaves are stopped in the same position, so that is good
[13:09:00] <marostegui>	 mysql is broken, it won't start
[13:09:03] <marostegui>	 we have to promote a new host
[13:09:05] <marostegui>	 db2043?
[13:09:21] <jynus>	 did you stop them manually or should I?
[13:09:30] <marostegui>	 stop them yeah
[13:09:32] <jynus>	 pfff, don't know
[13:09:34] <marostegui>	 I will prepare the patches
[13:09:47] <marostegui>	 we can promote the old master
[13:09:49] <jynus>	 35 didn't have partitioning
[13:10:15] <jynus>	 so no alter was done there
[13:10:29] <jynus>	 on the other side, maybe an older host has problems when many objects are queried
[13:10:49] <marostegui>	 yeah, could be
[13:10:50] <jynus>	 with 35 I meant 36
[13:10:56] <marostegui>	 let's promote the old master for now?
[13:11:06] <icinga-wm>	 PROBLEM - mysqld processes on db2036 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[13:11:07] <jynus>	 all are on 10.0.33
[13:11:23] <marostegui>	 going to silence all the codfw hosts
[13:11:37] <jynus>	 that is ok to me, although we will have to change it again
[13:11:38] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0],Exec[absent_ensure_members]
[13:11:50] <jynus>	 I almost prefer another host
[13:12:03] <jynus>	 so we reconstruct db2035 from db2018
[13:12:09] <jynus>	 *db2036, again
[13:12:24] <wikibugs>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9778/" [puppet] - 10https://gerrit.wikimedia.org/r/404954 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey)
[13:12:27] <marostegui>	 sure
[13:12:30] <wikibugs>	 (03PS3) 10Elukey: Allow to set the JAVA_HOME env variable in hadoop/hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/404954 (https://phabricator.wikimedia.org/T166248)
[13:12:32] <jynus>	 your first proposal was ok db2043
[13:12:53] <marostegui>	 let's go for it
[13:13:00] <marostegui>	 you do the change master and I prepare the patches?
[13:13:04] <jynus>	 let's double check with db1075 position
[13:13:09] <jynus>	 I will do that, then
[13:13:14] <marostegui>	 ok, I will prepare the patches
[13:13:21] <jynus>	 you prepare puppet and mediawiki
[13:13:34] <marostegui>	 yep
[13:13:42] <marostegui>	 the slaves stopped at: 1000278638
[13:13:43] <marostegui>	 all of them
[13:13:47] <marostegui>	 but double check it
[13:14:31] <jynus>	 yes, but we need a real master coords
[13:14:48] <jynus>	 for db2043
[13:15:38] <marostegui>	 yep
[13:16:12] <wikibugs>	 (03PS2) 10Filippo Giunchedi: scap: require sudo rules to be in place before deploy [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189)
[13:16:23] <marostegui>	 let me know when I can restart db2043 for statement based change
[13:18:04] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2043 to be s3 master [puppet] - 10https://gerrit.wikimedia.org/r/404974
[13:19:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] scap: require sudo rules to be in place before deploy [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189) (owner: 10Filippo Giunchedi)
[13:19:24] <jynus>	 !log changing topology of codfw s3 databases
[13:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:24] <marostegui>	 jynus: we have to restart db2043, let me know when I can proceed
[13:20:33] <marostegui>	 after the topology change 
[13:22:28] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3037 is CRITICAL: CRITICAL: expiry mailbox lag is 2062222
[13:22:41] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Promote db2043 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404976
[13:22:43] <jynus>	 I am changing to MASTER_LOG_FILE='db2043-bin.003722', MASTER_LOG_POS=379002384
[13:22:50] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/9779/" [puppet] - 10https://gerrit.wikimedia.org/r/404974 (owner: 10Marostegui)
[13:22:53] <jynus>	 all other replicas
[13:23:21] <jynus>	 marostegui: ok?
[13:23:36] <marostegui>	 checking that
[13:23:37] <marostegui>	 one sec
[13:23:56] <marostegui>	 379002384 is correct and so is db2043-bin.003722
[13:24:00] <marostegui>	 go ahead
[13:24:03] <jynus>	 thanks
[13:24:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] "noop as expected https://puppet-compiler.wmflabs.org/compiler03/9780/" [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189) (owner: 10Filippo Giunchedi)
[13:24:30] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db2018 is OK: OK slave_io_state Slave_IO_Running: Yes
[13:24:58] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db2050 is OK: OK slave_io_state Slave_IO_Running: Yes
[13:24:58] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db2057 is OK: OK slave_io_state Slave_IO_Running: Yes
[13:25:19] <icinga-wm>	 PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:25:19] <icinga-wm>	 PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:25:28] <icinga-wm>	 PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:25:29] <wikibugs>	 (03PS3) 10Filippo Giunchedi: scap: require sudo rules to be in place before deploy [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189)
[13:25:38] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db2074 is OK: OK slave_io_state Slave_IO_Running: Yes
[13:25:38] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:25:48] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:25:58] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[13:26:09] <icinga-wm>	 PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:26:19] <icinga-wm>	 PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:27:01] <jynus>	 the topology change of lower level is done, I have yet to promote db2043 as master
[13:27:14] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/404974/1
[13:27:18] <icinga-wm>	 PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer
[13:27:21] <marostegui>	 let's restart db2043 with statement
[13:27:28] <marostegui>	 puppet is stopped there
[13:27:39] <marostegui>	 let's merge, stop, run puppet and start mysql
[13:28:14] <jynus>	 let me gather all pos first
[13:28:39] <marostegui>	 cool
[13:29:20] <icinga-wm>	 RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient
[13:29:20] <icinga-wm>	 RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full
[13:29:21] <icinga-wm>	 RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[13:29:21] <icinga-wm>	 RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational
[13:29:21] <icinga-wm>	 RECOVERY - DPKG on pybal-test2001 is OK: All packages OK
[13:29:30] <icinga-wm>	 RECOVERY - Disk space on pybal-test2001 is OK: DISK OK
[13:29:40] <icinga-wm>	 RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up
[13:29:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set
[13:31:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "No, I don't think it's you. Look at the production catalog as well (http://puppet-compiler.wmflabs.org/9768/ganeti1001.eqiad.wmnet/prod.ga" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[13:34:16] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[13:34:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "I 'll merge to confirm my suspicion and I 'll be ready for a rollback" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[13:34:53] <jynus>	 sorry it is taking me some time, as it is not user facing, I am trying to be through
[13:35:14] <marostegui>	 You have the position from where db2043 needs to replicate on db1075?
[13:35:47] <marostegui>	 db1075-bin.002424
[13:35:47] <marostegui>	 432232308
[13:35:49] <marostegui>	 that is what I got
[13:36:59] <jynus>	 I am on it
[13:39:06] <jynus>	 lats commit on binlog is a set of INSERTs INTO `srwiki`.`pagelinks`
[13:40:34] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on restbase2007 - https://phabricator.wikimedia.org/T185214#3909578 (10ops-monitoring-bot)
[13:40:47] <jynus>	 I get 432231034
[13:41:20] <wikibugs>	 (03PS3) 10Awight: Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002
[13:41:29] <jynus>	 its gtid is 171966669-171966669-1746907396
[13:41:40] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[13:41:48] <marostegui>	 432231034 for db1075 you mean?
[13:41:58] <jynus>	 yes
[13:42:20] <jynus>	 that is around 1K before yours
[13:42:29] <marostegui>	 yep
[13:42:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ganeti: Remove KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/404977
[13:42:32] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ganeti: Remove long absented resource [puppet] - 10https://gerrit.wikimedia.org/r/404978
[13:42:38] <marostegui>	 let me double check
[13:42:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "A noop on ganeti1001, no firewall rule changes, the bug seems to rest with the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[13:44:50] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on restbase2007 - https://phabricator.wikimedia.org/T185214#3909585 (10fgiunchedi) 05Open>03Invalid That's me turning HBA mode on as part of {T184100} and forgetting the downtime
[13:44:56] <jynus>	 marostegui: update on dawiki.page; insert on srwiki.pagelinks (CUT) INSERT on zh_min_nanwiki.modulo_deps
[13:45:07] <marostegui>	 yeah, I was seeing that right ow
[13:45:08] <marostegui>	 now
[13:49:14] <marostegui>	 yeah
[13:49:19] <marostegui>	 I think it is 432231034
[13:49:34] <moritzm>	 !log upgrade mw* servers in eqiad running 3.18.5+dfsg-1+wmf3 (recent installations) to 3.18.5+dfsg-1+wmf4
[13:49:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:51] <wikibugs>	 (03PS1) 10Ema: cache_upload: upgrade cp3037 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404979 (https://phabricator.wikimedia.org/T180433)
[13:50:26] <zeljkof>	 hm, much of those at wmlog: Cannot access property on non-object in /srv/mediawiki/php-1.31.0-wmf.17/extensions/GlobalBlocking/includes/api/ApiGlobalBlock.php
[13:51:10] <wikibugs>	 (03PS1) 10Gehel: mediawiki: remove unused logging configuration of mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304)
[13:51:33] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_upload: upgrade cp3037 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404979 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema)
[13:51:52] <marostegui>	 jynus: that UPDATE happened on the slaves then?
[13:52:16] <marostegui>	 I mean, is the row consistent?
[13:52:24] <jynus>	 I am checking
[13:52:43] <jynus>	 the insert on srwiki.pagelinks happened
[13:52:55] <jynus>	 I am now checking that the latest insert didn't happen
[13:53:44] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler: Puppet compiler failure to lookup some keys - https://phabricator.wikimedia.org/T185215#3909612 (10akosiaris) p:05Triage>03High Triaging as high as this might bite us and cause issues. We should investigate more and act accordingly
[13:53:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Tracked in https://phabricator.wikimedia.org/T185215" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[13:54:05] <ema>	 !log cache_upload: upgrade cp3037 to varnish 5
[13:54:09] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: ganeti: Remove KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/404977
[13:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:54] <jynus>	 marostegui: the dawiki update did not happen
[13:55:26] <marostegui>	 that is good then
[13:55:51] <jynus>	 so I have checked it is between 432231034 and 432232308
[13:56:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] ganeti: Remove KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/404977 (owner: 10Alexandros Kosiaris)
[13:57:09] <marostegui>	 yeah, if nothing after 432232533 happened, then we are safe
[13:57:37] <jynus>	 but check the new mater binary log
[13:57:44] <jynus>	 the last insert is not there
[13:57:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: Remove KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/404977 (owner: 10Alexandros Kosiaris)
[13:57:53] <marostegui>	 db2043 you mean?
[13:57:59] <jynus>	 yes
[13:59:42] <marostegui>	 yeah, the dawiki update isn't there
[13:59:52] <jynus>	 but I think it is already inserted, maybe? so the last commit has not been logged?
[14:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T1400).
[14:00:05] <jouncebot>	 rxy: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:13] <zeljkof>	 I can SWAT today
[14:00:21] <zeljkof>	 rxy: around for SWAT?
[14:00:28] <marostegui>	 jynus:  I am checking the insert on srwiki
[14:00:30] <rxy>	 zeljkof: hi, Yes
[14:00:30] <wikibugs>	 (03CR) 10Gehel: "Puppet compiler agree, this is a noop: https://puppet-compiler.wmflabs.org/compiler02/9781/" [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel)
[14:00:39] <jynus>	 or, it has been delated later
[14:00:53] <zeljkof>	 rxy: I'll let you know when the commit is at mwdebug1002, it should be there in a few minutes
[14:01:06] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy)
[14:01:38] <marostegui>	 I don't see: 
[14:01:40] <marostegui>	 use `srwiki`/*!*/;
[14:01:40] <marostegui>	 SET TIMESTAMP=1516280175/*!*/;
[14:01:40] <marostegui>	 INSERT /* LinksUpdate::incrTableUpdate
[14:01:43] <marostegui>	 on the new master binlog
[14:02:22] <jynus>	 I do
[14:02:30] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3037 is OK: OK: expiry mailbox lag is 0
[14:02:36] <wikibugs>	 (03Merged) 10jenkins-bot: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy)
[14:02:52] <wikibugs>	 (03CR) 10jenkins-bot: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy)
[14:02:57] <marostegui>	 On: db2043-bin.003722 ?
[14:03:19] <jynus>	 see etherpad
[14:03:23] <marostegui>	 checking
[14:03:28] <jynus>	 maybe we are talking about something different
[14:03:50] <jynus>	 yeah the LinksUpdate::incrTableUpdate gets written differently
[14:03:51] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 80 on cp3037 is CRITICAL: connect to address 10.20.0.172 and port 80: Connection refused
[14:04:03] <ema>	 that's me ^
[14:04:42] <jynus>	 in row format
[14:04:50] <wikibugs>	 (03CR) 10Gehel: "I'm wondering if we might have an unusual use case on WMCS which this might break... Worst case, we should just go to the class default an" [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel)
[14:04:51] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 80 on cp3037 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.168 second response time
[14:05:42] <jynus>	 maybe that happened- it was a null REPLACE so it didn't get logged on the new master binlog
[14:06:01] <zeljkof>	 rxy: 404911 is at mwdebug1002, please test and let me know if I can deploy
[14:06:21] <zeljkof>	 let me know if you don't know how to test there
[14:06:28] <zeljkof>	 there is documentation about it
[14:06:54] <marostegui>	 jynus: yeah, could be
[14:06:56] <jynus>	 you mention linksupdate
[14:07:10] <marostegui>	 yes, for srwiki
[14:07:11] <jynus>	 I am mentioning pagelinks
[14:07:21] <rxy>	 zeljkof:  https://zh.wikibooks.org/wiki/Special:群组权限?uselang=en   with "X-Wikimedia-Debug: mwdebug1002.eqiad.wmnet"  ok, It's work for me
[14:07:29] <marostegui>	 yeah, I was talking about: use `srwiki`/*!*/;
[14:07:29] <marostegui>	 SET TIMESTAMP=1516280175/*!*/;
[14:07:29] <marostegui>	 INSERT /* LinksUpdate::incrTableUpdate  */ IGNORE INTO `templatelinks`
[14:07:46] <zeljkof>	 rxy: ok, deploying
[14:08:41] <jynus>	 see my pointer
[14:08:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.32, 35.35, 31.91
[14:09:03] <jynus>	 templatelinks, which position?
[14:09:32] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404911|Change autoconfirmed settings and Enable flood group at zhwikibooks (T185182)]] (duration: 01m 13s)
[14:09:33] <marostegui>	 432233002
[14:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:43] <stashbot>	 T185182: Request to change autoconfirmed settings, allow autoconfirmed user to suppress redirects and allow sysop to grant and remove flood flags on zh.wikibooks - https://phabricator.wikimedia.org/T185182
[14:09:53] <zeljkof>	 rxy: it's deployed, please check and thanks for deploying with #releng :)
[14:10:12] <jynus>	 yes, that is later
[14:10:21] <jynus>	 so I think we do not disagree on that
[14:10:26] <ema>	 !log cache_upload: repool cp3037 (varnish 5)
[14:10:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:55] <zeljkof>	 !log EU SWAT finished
[14:11:00] <marostegui>	 Cool, the srvwiki.pagelinks I do see on db2043 binlog yeah
[14:11:04] <marostegui>	 so we are agreeing
[14:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:11] <jynus>	 yeah, I think there is an extra query
[14:12:21] <jynus>	 that has been executed, but it is not on the binary log
[14:12:50] <wikibugs>	 (03PS1) 10Awight: Use a global mysql user if configured [dumps] - 10https://gerrit.wikimedia.org/r/404986
[14:12:51] <jynus>	 in any case, we can check that single row after syncyng them
[14:12:52] <wikibugs>	 (03PS1) 10Awight: Tolerate empty tables [dumps] - 10https://gerrit.wikimedia.org/r/404987
[14:13:06] <jynus>	 or we can reboot the server and it should appear
[14:13:06] <rxy>	 zeljkof:  thanks.  I verified configuration in  mw1329.eqiad.wmnet.  It's work.
[14:13:16] <jynus>	 let me change the server and we can reboot it
[14:13:23] <jynus>	 did you merge the patches
[14:13:45] <marostegui>	 Nope, waiting for you
[14:13:54] <marostegui>	 Let's stop mysql, merge, run puppet and start
[14:14:05] <marostegui>	 going to merge now, puppet is stopped on db2043
[14:14:36] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Promote db2043 to be s3 master [puppet] - 10https://gerrit.wikimedia.org/r/404974 (owner: 10Marostegui)
[14:14:38] <wikibugs>	 (03PS4) 10Awight: Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002 (https://phabricator.wikimedia.org/T185116)
[14:14:40] <wikibugs>	 (03PS2) 10Awight: Use a global mysql user if configured [dumps] - 10https://gerrit.wikimedia.org/r/404986 (https://phabricator.wikimedia.org/T185116)
[14:14:41] <wikibugs>	 (03PS2) 10Awight: Tolerate empty tables [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116)
[14:14:44] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db2043 to be s3 master [puppet] - 10https://gerrit.wikimedia.org/r/404974
[14:16:06] <marostegui>	 jynus: all merged, puppet remains stopped till you stop mysql
[14:16:24] <jynus>	 ok, then stopping it, is the alert down?
[14:16:26] <wikibugs>	 (03PS1) 10Ema: cache_upload: upgrade cp3035 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404990 (https://phabricator.wikimedia.org/T180433)
[14:16:48] <marostegui>	 yep, I have downtimed all codfw
[14:17:08] <jynus>	 !log stopping mysql on db2043
[14:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:28] <Jayprakash12345>	 Biplab:Are you here
[14:17:49] <Biplab>	 Jayprakash12345: Yes
[14:17:51] <icinga-wm>	 PROBLEM - DPKG on mw1347 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:18:01] <icinga-wm>	 PROBLEM - DPKG on mw1346 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:18:04] <jynus>	 down
[14:18:14] <marostegui>	 ok, enabling puppet and running it
[14:18:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:18:50] <icinga-wm>	 RECOVERY - DPKG on mw1347 is OK: All packages OK
[14:18:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 47.39, 37.74, 31.61
[14:19:01] <icinga-wm>	 RECOVERY - DPKG on mw1346 is OK: All packages OK
[14:19:11] <marostegui>	 we can start it again
[14:19:13] <marostegui>	 puppet ran
[14:19:21] <icinga-wm>	 RECOVERY - HHVM rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 75320 bytes in 0.330 second response time
[14:19:31] <jynus>	 !log starting mysql on db2043
[14:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] network: reword slice_network_constants' errors [puppet] - 10https://gerrit.wikimedia.org/r/403699 (owner: 10Faidon Liambotis)
[14:20:07] <marostegui>	 running puppet to get pt-hearbeat up
[14:20:07] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: network: reword slice_network_constants' errors [puppet] - 10https://gerrit.wikimedia.org/r/403699 (owner: 10Faidon Liambotis)
[14:20:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] network: reword slice_network_constants' errors [puppet] - 10https://gerrit.wikimedia.org/r/403699 (owner: 10Faidon Liambotis)
[14:20:29] <moritzm>	 hhvm dpkg alerts is me
[14:20:39] <jynus>	 if you are ok, I will start replication again
[14:20:41] <icinga-wm>	 PROBLEM - puppet last run on mw1343 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm]
[14:20:44] <Amir1>	 Lydia_WMDE: https://phabricator.wikimedia.org/T175491
[14:20:47] <marostegui>	 yeah, go ahead
[14:20:59] <godog>	 _joe_: there's some mw servers with high load, what's needed before restarting hhvm?
[14:21:07] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: base: wrap lines in check_puppetrun to < 110 [puppet] - 10https://gerrit.wikimedia.org/r/403698 (owner: 10Faidon Liambotis)
[14:21:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] base: wrap lines in check_puppetrun to < 110 [puppet] - 10https://gerrit.wikimedia.org/r/403698 (owner: 10Faidon Liambotis)
[14:21:12] <marostegui>	 jynus: going to deploy mediawiki-config
[14:21:17] <_joe_>	 godog: uh, which ones?
[14:21:31] <_joe_>	 godog: I'd take a look at the output of hhvm-dump-debug
[14:21:50] <Amir1>	 sorry, wrong channel.
[14:21:57] <godog>	 _joe_: 2 cricital and 4 warning in icinga atm
[14:22:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Promote db2043 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404976 (owner: 10Marostegui)
[14:22:13] <Jayprakash12345>	 Please visit https://wikitech.wikimedia.org/wiki/Deployments#Week_of_January_15th
[14:22:14] <wikibugs>	 (03PS2) 10Ema: cache_upload: upgrade cp3035 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404990 (https://phabricator.wikimedia.org/T180433)
[14:22:18] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp3035 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404990 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema)
[14:22:46] <godog>	 _joe_: ok so save hhvm-dump-debug and systemctl restart hhvm
[14:23:10] <_joe_>	 godog: yes
[14:23:41] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Promote db2043 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404976 (owner: 10Marostegui)
[14:23:44] <elukey>	 there is also restart-hhvm to properly depool no?
[14:23:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron)
[14:23:46] <Jayprakash12345>	 who can swat today?
[14:23:51] <ema>	 !log cache_upload: upgrade cp3035 to varnish 5
[14:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:14] <elukey>	 godog: let me know if you need help, I was about to check
[14:24:31] <icinga-wm>	 PROBLEM - puppet last run on mw1342 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm]
[14:24:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991
[14:24:54] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991
[14:25:06] <godog>	 elukey: yup will do 
[14:25:15] <godog>	 !log restart hhvm on mw1227
[14:25:18] <jynus>	 strange, still not logged on the binlog, but the row is present
[14:25:20] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Promote db2043 to s3 master after db2036 crash (duration: 01m 12s)
[14:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:33] <marostegui>	 jynus: mediawiki-config deployed
[14:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:51] <_joe_>	 godog: I'll take a look at the rest of the cluster
[14:26:14] <marostegui>	 jynus: sync binlog and trx commit maybe was changed on the live config?
[14:26:22] <_joe_>	 godog: uhm wait, this is pretty strange
[14:26:34] <marostegui>	 As a left over from something and was not 1 and 1?
[14:26:48] <_joe_>	 uhm actually no, they do need restarts I'd say
[14:26:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991 (owner: 10Marostegui)
[14:27:03] <jynus>	 so I start replication, ok?
[14:27:06] <marostegui>	 yep
[14:27:08] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Promote db2043 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404976 (owner: 10Marostegui)
[14:27:12] <godog>	 _joe_: so mw1226 alarmed but looks like it recovered by itself
[14:27:19] <jynus>	 let's see if it complains
[14:27:20] <_joe_>	 godog: let me take a look
[14:27:24] <godog>	 _joe_: though mw1227 didn't and I restarted hhvm there
[14:27:39] <_joe_>	 godog: there is a very high load on the api cluster
[14:27:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db2043 is OK: OK slave_io_state Slave_IO_Running: Yes
[14:27:55] <jynus>	 it doesn't
[14:28:14] <jynus>	 this is some vodoo magic due to mic of row and statement?
[14:28:19] <jynus>	 *mix
[14:28:22] <Jayprakash12345>	 SWAT: All is Right?
[14:28:33] <ema>	 !log cache_upload: repool cp3035 (varnish 5)
[14:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:43] <marostegui>	 jynus: but it should still appear on the binlog
[14:28:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991 (owner: 10Marostegui)
[14:28:59] <jynus>	 maybe it is is some strange cache- I will search it
[14:29:10] <jynus>	 maybe something related to parallel replication && caches
[14:29:28] <marostegui>	 jynus: or as I said, maybe sync_binlog/trx_commit were changed manually for some reason
[14:29:34] <marostegui>	 don't know, it is weird
[14:29:48] <jynus>	 the missing insert appeared
[14:30:05] <marostegui>	 ?????
[14:30:22] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 - T174569 (duration: 01m 12s)
[14:30:32] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991 (owner: 10Marostegui)
[14:30:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:35] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[14:30:38] <jynus>	 REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2018-01-18T14:27:26.000930'
[14:30:40] <jynus>	 and later
[14:30:51] <wikibugs>	 (03PS1) 10Marostegui: db2036: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/404992
[14:30:56] <_joe_>	 HPHP::jit::enterTCImpl seems to block quite a few threads
[14:30:58] <jynus>	 #180118 12:56:15 server id 171966669
[14:30:58] <Jayprakash12345>	 Hello SWAT?
[14:31:11] <marostegui>	 jynus: :|
[14:31:11] <jynus>	 INSERT /* Wikimedia\Rdbms\DatabaseMysqlBase::upsert  */ INTO `module_deps` (md_module,md_skin,md_deps) VALUES (
[14:31:16] <_joe_>	 !log restarting hhvm on a few API appservers
[14:31:18] <marostegui>	 cache then?
[14:31:21] <jynus>	 so parallel weirdness
[14:31:24] <_joe_>	 godog: there are a few more to restart
[14:31:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:34] <jynus>	 making binlog inconsistent with state
[14:31:35] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db2036: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/404992 (owner: 10Marostegui)
[14:31:41] <jynus>	 even after reboot!
[14:31:45] <Jayprakash12345>	 Zppix: ?
[14:31:50] <jynus>	 because this is on the new binlog after reboot
[14:32:15] <wikibugs>	 10Puppet, 10Analytics-Kanban, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909685 (10elukey) Fixed all except j1.analytics.eqiad.wmflabs - @Ottomata do we still need this? It seems running superset, and puppet is broken in there..
[14:32:19] <marostegui>	 This is so weird...
[14:32:22] <marostegui>	 And scary
[14:32:24] <jynus>	 I do not like this- I would be ok with the cache 
[14:32:25] <Zppix>	 Jayprakash12345: yes?
[14:32:37] <jynus>	 but not in cold
[14:32:40] <Jayprakash12345>	 Zppix: https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_January_17
[14:33:27] <Zppix>	 Jayprakash12345: what about it?
[14:33:31] <jynus>	 on the other side, they have different domain ids
[14:33:44] <jynus>	 so we do not guarantee consisentcy
[14:34:00] <jynus>	 because different serverid == different domain
[14:34:23] <godog>	 _joe_: ack, all of those in warning in icinga?
[14:34:26] <Jayprakash12345>	 Zppix:Are you SWAT member???
[14:34:29] <Zppix>	 Jayprakash12345: you realize i have no prod access right... and today is the 18th
[14:34:30] <_joe_>	 godog: at least
[14:34:46] <_joe_>	 godog: I'm doing all servers up to mw1230 btw
[14:35:02] <Zppix>	 Jayprakash12345: aka im not a deployer
[14:35:12] <Jayprakash12345>	 Zppix: Sorry
[14:35:23] <Zppix>	 Jayprakash12345: no worries :P
[14:35:25] <_joe_>	 godog: use "restart-hhvm" btw
[14:35:38] <Zppix>	 Jayprakash12345: regardless the next swat window is not for a while
[14:35:40] <_joe_>	 it depools the host, restarts hhvm, then it pools it back
[14:36:13] <Jayprakash12345>	 zeljkof: Can you merge https://gerrit.wikimedia.org/r/#/c/404067/
[14:36:42] <wikibugs>	 (03CR) 10Steinsplitter: [C: 031] "looks ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404148 (https://phabricator.wikimedia.org/T184865) (owner: 10Biplab Anand)
[14:36:59] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 9.15, 15.20, 23.20
[14:37:00] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 8.95, 13.50, 23.78
[14:37:19] <andre__>	 Jayprakash12345: Is there a reason why you specifically ask zeljkof ? 
[14:37:43] <wikibugs>	 (03PS1) 10Ema: cache_upload: upgrade cp3038 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404993 (https://phabricator.wikimedia.org/T180433)
[14:37:46] <Jayprakash12345>	 Because He is SWAT member
[14:38:02] <godog>	 _joe_: sounds good, I'll do on mw1233
[14:39:12] <godog>	 !log restart hhvm on mw1233
[14:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:29] <icinga-wm>	 RECOVERY - puppet last run on mw1342 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:39:39] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_upload: upgrade cp3038 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404993 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema)
[14:39:54] <ema>	 !log cache_upload: upgrade cp3038 to varnish 5
[14:40:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:16] <_joe_>	 the cpu utilization curve of hhvm after a restart will never ever cease to amaze me. for the first couple minutes, it's worse than php5
[14:40:44] <jynus>	 _joe_: JIT?
[14:40:58] <zeljkof>	 andre__: it's SWAT time :)
[14:41:01] <_joe_>	 jynus: that and re-loading of all files into the repo
[14:41:29] <_joe_>	 they need to validate the files when starting, but yeah it's mostly jit and caching
[14:42:05] <jynus>	 maybe we could "pre-compile" some or most of the code by executing reads on repool?
[14:42:30] <_joe_>	 we did do something like that
[14:42:33] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150)
[14:42:35] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150)
[14:42:37] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Simplify profile::grafana::production [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150)
[14:42:40] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: grafana: Allow to modify the config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150)
[14:42:41] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150)
[14:42:43] <_joe_>	 but the effects weren't good enough to justify investing in it
[14:42:44] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150)
[14:43:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler02/9782/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[14:43:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler02/9782/" [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[14:43:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler02/9782/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[14:43:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler02/9782/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[14:44:18] <herron>	 !log disabling puppet agents during deploy of 404587, 404689
[14:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:10] <ema>	 !log cache_upload: repool cp3038 (varnish 5)
[14:45:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:45] <wikibugs>	 (03CR) 10Herron: [C: 032] puppetmaster::ssl: fix crl file suffix [puppet] - 10https://gerrit.wikimedia.org/r/404587 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron)
[14:45:50] <wikibugs>	 (03PS2) 10Herron: puppetmaster::ssl: fix crl file suffix [puppet] - 10https://gerrit.wikimedia.org/r/404587 (https://phabricator.wikimedia.org/T184444)
[14:46:26] <wikibugs>	 (03PS3) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444)
[14:48:42] <wikibugs>	 (03CR) 10Herron: [C: 032] add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron)
[14:49:06] <wikibugs>	 (03PS4) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444)
[14:51:07] <wikibugs>	 (03CR) 10Rush: [C: 031] "we all talked it over, seems like a reasonable idea" [puppet] - 10https://gerrit.wikimedia.org/r/403621 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi)
[14:53:41] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404995 (https://phabricator.wikimedia.org/T162807)
[14:54:19] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:54:42] <herron>	 ^ that’s me
[14:56:59] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404995 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[14:58:37] <volans>	 !log reprepro includedeb jessie-wikimedia python-requests-mock_1.3.0-3_all.deb
[14:58:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:51] <moritzm>	 !log installing bind security updates (we only use the client-side tools)
[14:59:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:06] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404995 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[14:59:16] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404995 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[14:59:49] <wikibugs>	 (03CR) 10Volans: [C: 032] PuppetDB backend: add support for API v4 [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) (owner: 10Volans)
[15:00:58] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1066  - T162807 (duration: 01m 11s)
[15:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:10] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[15:01:43] <marostegui>	 !log Stop replication in sync db1089 and db1066 - T162807
[15:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:20] <wikibugs>	 (03Merged) 10jenkins-bot: PuppetDB backend: add support for API v4 [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) (owner: 10Volans)
[15:05:35] <wikibugs>	 (03CR) 10jenkins-bot: PuppetDB backend: add support for API v4 [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) (owner: 10Volans)
[15:07:00] <icinga-wm>	 PROBLEM - DPKG on restbase2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:08:09] <icinga-wm>	 RECOVERY - Keyholder SSH agent on labpuppetmaster1002 is OK: OK: Keyholder is armed with all configured keys.
[15:10:09] <icinga-wm>	 RECOVERY - DPKG on restbase2005 is OK: All packages OK
[15:10:10] <icinga-wm>	 PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:10:40] <icinga-wm>	 PROBLEM - puppet last run on mw1343 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm]
[15:11:12] <logmsgbot>	 !log mforns@tin Started deploy [analytics/refinery@78f98d9]: deploying refinery to add ISO codes to pageviews by country
[15:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:00] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 54.93 seconds
[15:12:10] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 31.23 seconds
[15:12:10] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 29.62 seconds
[15:12:10] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2018 is OK: OK slave_sql_lag Replication lag: 29.63 seconds
[15:12:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[15:12:53] <jynus>	 I will reimage db2036 with db2018 data
[15:14:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] "I think that if something on a WMCS VM is logging to production logstash then that's already wrong.  So this seems fine with me unless you" [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel)
[15:14:58] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404997
[15:15:03] <wikibugs>	 (03PS1) 10Volans: Copyright notice: add 2018 [software/cumin] - 10https://gerrit.wikimedia.org/r/404998
[15:15:25] <logmsgbot>	 !log mforns@tin Finished deploy [analytics/refinery@78f98d9]: deploying refinery to add ISO codes to pageviews by country (duration: 04m 12s)
[15:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:22] <marostegui>	 jynus: ok!
[15:16:40] <icinga-wm>	 PROBLEM - etc request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 58251 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:17:10] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404997 (owner: 10Marostegui)
[15:17:12] <volans>	 akosiaris: FYI typo, missing 'd' in etcd :-P ^^^
[15:18:03] <wikibugs>	 (03CR) 10Volans: [C: 032] Copyright notice: add 2018 [software/cumin] - 10https://gerrit.wikimedia.org/r/404998 (owner: 10Volans)
[15:18:40] <icinga-wm>	 RECOVERY - etc request latencies on neon is OK: OK - etcd_request_latencies is 3791 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:18:42] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: grafana: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/405000
[15:18:46] <akosiaris>	 volans: lol
[15:18:49] <wikibugs>	 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3909792 (10fgiunchedi) I tried manually setting `rootdelay` from within grub on the first boot of restbase2007 and 2008 with 5 and 3 respectively, both worked as in the raid arrays were assembled
[15:19:14] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404997 (owner: 10Marostegui)
[15:19:25] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404997 (owner: 10Marostegui)
[15:20:26] <wikibugs>	 (03Merged) 10jenkins-bot: Copyright notice: add 2018 [software/cumin] - 10https://gerrit.wikimedia.org/r/404998 (owner: 10Volans)
[15:20:40] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix etcd request latencies alert typo [puppet] - 10https://gerrit.wikimedia.org/r/405001
[15:20:41] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066  - T162807 (duration: 01m 12s)
[15:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:53] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[15:21:10] <icinga-wm>	 PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:21:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/405000 (owner: 10Alexandros Kosiaris)
[15:21:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Fix etcd request latencies alert typo [puppet] - 10https://gerrit.wikimedia.org/r/405001 (owner: 10Alexandros Kosiaris)
[15:21:38] <wikibugs>	 (03CR) 10jenkins-bot: Copyright notice: add 2018 [software/cumin] - 10https://gerrit.wikimedia.org/r/404998 (owner: 10Volans)
[15:22:25] <wikibugs>	 10Operations, 10ops-eqiad: check americium eth1 cabling and link - https://phabricator.wikimedia.org/T185219#3909801 (10Jgreen)
[15:22:40] <icinga-wm>	 PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:23:10] <icinga-wm>	 PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[dnsutils]
[15:30:43] <wikibugs>	 (03PS1) 10Ema: cache_upload: upgrade esams to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/405007 (https://phabricator.wikimedia.org/T180433)
[15:33:29] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_upload: upgrade esams to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/405007 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema)
[15:33:51] <moritzm>	 !log reboot labsdb1006 (OSM slave) for kernel security update
[15:34:02] <ema>	 !log cache_upload: upgrade cp3047 to varnish 5
[15:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:17] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:35:18] <icinga-wm>	 RECOVERY - puppet last run on wtp1028 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[15:40:07] <ema>	 !log cache_upload: repool cp3047 (varnish 5)
[15:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:58] <moritzm>	 !log rebooting labsdb1004 for kernel security update
[15:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:16] <wikibugs>	 10Puppet, 10Analytics-Kanban, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909876 (10Ottomata) j1 deleted!
[15:42:38] <wikibugs>	 10Puppet, 10Analytics-Kanban, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909884 (10elukey) 05Open>03Resolved Closing the task since puppet should be ok now, please re-open otherwise!
[15:43:03] <jynus>	 moritzm: did you stop mariadb in advance?
[15:43:50] <moritzm>	 jynus: that's postgres, I've synched the reboot with Aaron who's keeping an eye on wikilabels
[15:44:00] <jynus>	 no, that is postgres and mariadb
[15:44:50] <jynus>	 can you take care and/or fix replication inconsistencies when it boots?
[15:45:15] <jynus>	 *take care of checking
[15:45:28] <moritzm>	 jynus: no, I wasn't aware of that, sorry. I've always just been following https://wikitech.wikimedia.org/wiki/Service_restarts#labsdb1004_(wikilabels) for this for all recent reboots
[15:45:52] <jynus>	 that is the postgres section
[15:46:04] <jynus>	 which obviously doesn't mention mariadb
[15:46:15] <icinga-wm>	 RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:46:51] <jynus>	 please take care of that or ping cloud so they can do it
[15:47:42] <moritzm>	 jynus: sorry, I was unaware of the mariadb part, will amend the docs. will ping the WMCS channel
[15:47:58] <jynus>	 the docs are ok
[15:48:07] <jynus>	 they tell you how to shutdown postgres
[15:48:16] <jynus>	 what they don't tell is what servers are where
[15:48:23] <icinga-wm>	 RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:48:24] <jynus>	 *services are on which servers
[15:48:24] <icinga-wm>	 PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:49:13] <wikibugs>	 (03CR) 10Ppchelko: "Why bother people with SWAT? It's just a cleanup, it can wait until we have something substantial" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko)
[15:50:16] <moritzm>	 yes, as I have mentioned I was unaware of mariadb being on that host, I'll add a reminder do the labsdb1004 docs so that this doesn't happen again
[15:50:43] <icinga-wm>	 RECOVERY - puppet last run on mw1343 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:51:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405008 (https://phabricator.wikimedia.org/T184100)
[15:52:43] <icinga-wm>	 RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:53:23] <icinga-wm>	 RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[15:53:48] <wikibugs>	 (03PS2) 10Andrew Bogott: puppet.conf: replace configtimeout [puppet] - 10https://gerrit.wikimedia.org/r/404480 (https://phabricator.wikimedia.org/T182585) (owner: 10Herron)
[15:55:19] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3909954 (10Jgreen)
[15:55:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] puppet.conf: replace configtimeout [puppet] - 10https://gerrit.wikimedia.org/r/404480 (https://phabricator.wikimedia.org/T182585) (owner: 10Herron)
[16:01:40] <wikibugs>	 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3909975 (10Anomie) >>! In T133410#3907781, @Iniquity wrote: > @Isarra we are waiting for T180817 this task.  Regarding that, see T180817#...
[16:02:10] <wikibugs>	 10Operations, 10Puppet: Update puppetmaster1001 puppet certificate - https://phabricator.wikimedia.org/T180167#3909979 (10herron) 05Open>03Resolved a:03herron done!
[16:02:14] <wikibugs>	 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3909982 (10herron)
[16:04:02] <wikibugs>	 (03CR) 10ArielGlenn: "This function is only called if the config file doesn't have entries in the [database] section for 'user' and 'password'. It's specificall" [dumps] - 10https://gerrit.wikimedia.org/r/404986 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight)
[16:04:20] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] "LGTM could merge whenever you want." [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight)
[16:09:00] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3910002 (10Ottomata)
[16:09:02] <wikibugs>	 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3909999 (10Ottomata) 05Open>03Resolved deployment-kafka03 has been deleted.
[16:13:00] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 4 minutes ago with 5 failures. Failed resources (up to 3 shown): Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0],Exec[absent_ensure_members],Exec[ops_ensure_members]
[16:14:43] <wikibugs>	 (03CR) 10Awight: "Thanks for noticing!  I must have changed the config file at the same time as the code..." [dumps] - 10https://gerrit.wikimedia.org/r/404986 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight)
[16:14:48] <wikibugs>	 (03Abandoned) 10Awight: Use a global mysql user if configured [dumps] - 10https://gerrit.wikimedia.org/r/404986 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight)
[16:15:38] <wikibugs>	 (03PS3) 10Awight: Tolerate empty tables [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116)
[16:18:42] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405008 (https://phabricator.wikimedia.org/T184100)
[16:20:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] install_server: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405008 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[16:21:39] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] "I agree with Andrew. If this is being picked up in a Cloud VPS host then the data is routing to the wrong place already and should be fixe" [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel)
[16:28:19] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3910062 (10jcrespo)
[16:29:58] <wikibugs>	 (03PS1) 10Cmjohnson: Removing mgmt dns entries for decom host erbium [dns] - 10https://gerrit.wikimedia.org/r/405012 (https://phabricator.wikimedia.org/T185226)
[16:33:11] <icinga-wm>	 PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[16:33:11] <icinga-wm>	 PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[16:33:21] <icinga-wm>	 PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[16:33:50] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[16:33:51] <icinga-wm>	 PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[16:34:00] <icinga-wm>	 PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[16:34:01] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[16:35:50] <icinga-wm>	 RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up
[16:35:51] <icinga-wm>	 RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient
[16:36:00] <icinga-wm>	 RECOVERY - DPKG on pybal-test2001 is OK: All packages OK
[16:36:01] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set
[16:36:11] <icinga-wm>	 RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full
[16:36:20] <icinga-wm>	 RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational
[16:36:21] <icinga-wm>	 RECOVERY - Disk space on pybal-test2001 is OK: DISK OK
[16:41:37] <wikibugs>	 (03PS1) 10Ottomata: Use jumbo Kafka for EventStreams in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/405014 (https://phabricator.wikimedia.org/T185225)
[16:43:01] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:53:17] <wikibugs>	 (03PS2) 10Ottomata: Use jumbo Kafka for EventStreams in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/405014 (https://phabricator.wikimedia.org/T185225)
[16:54:18] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Use jumbo Kafka for EventStreams in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/405014 (https://phabricator.wikimedia.org/T185225) (owner: 10Ottomata)
[16:54:21] <wikibugs>	 (03CR) 10Ottomata: [C: 032] "No op https://puppet-compiler.wmflabs.org/compiler02/9785/" [puppet] - 10https://gerrit.wikimedia.org/r/405014 (https://phabricator.wikimedia.org/T185225) (owner: 10Ottomata)
[16:55:53] <ema>	 !log cache_upload: upgrade cp3036 to varnish 5
[16:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:39] <ottomata>	 hey, anyone here messing with deploymenet-puppetmaster02?  there are local changes to puppet repo there
[16:59:41] <icinga-wm>	 RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:00:04] <jouncebot>	 godog, moritzm, and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T1700).
[17:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:01:02] <ottomata>	 i'm stashing them
[17:02:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] hieradata: extend SMART eqiad deployment [puppet] - 10https://gerrit.wikimedia.org/r/403621 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi)
[17:02:19] <ema>	 !log cache_upload: repool cp3036 (varnish 5)
[17:02:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:22] <wikibugs>	 (03PS1) 10Dduvall: Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016
[17:04:01] <wikibugs>	 (03PS2) 10Dduvall: Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016
[17:05:33] <cmjohnson1>	 If you see mgmt icing errors for the next 20 mins or so that is me...all servers will be in c8
[17:08:10] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[17:11:07] <wikibugs>	 (03PS3) 10Dduvall: Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016
[17:11:40] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910184 (10RobH) p:05Triage>03High
[17:12:04] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910199 (10RobH) Since this involves #traffic as well as #netops, this plan should get @bblack's review/approval.
[17:12:32] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3910200 (10ayounsi)
[17:14:13] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10ayounsi) Port mapping is correct. Note that the "fe-" ports will be renamed "ge-", but their physical location is unchanged.
[17:14:17] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910209 (10RobH)
[17:18:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: swap restbase2009 disk layout [puppet] - 10https://gerrit.wikimedia.org/r/405019 (https://phabricator.wikimedia.org/T184100)
[17:20:01] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: swap restbase2009 disk layout [puppet] - 10https://gerrit.wikimedia.org/r/405019 (https://phabricator.wikimedia.org/T184100)
[17:20:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] install_server: swap restbase2009 disk layout [puppet] - 10https://gerrit.wikimedia.org/r/405019 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[17:21:00] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10Volans) I've run clean + deactivate for cp4018 as part of cleanup of stale puppet certs.
[17:26:06] <ema>	 !log cache_upload: upgrade cp3039 to varnish 5
[17:26:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:21] <wikibugs>	 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910263 (10ayounsi)
[17:28:10] <wikibugs>	 (03CR) 10Mobrovac: [C: 04-1] cassandra: create parent data directories with exec (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans)
[17:29:36] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3910280 (10fgiunchedi) 05Open>03Resolved Thanks @Papaul ! Disk rebuilding
[17:29:53] <wikibugs>	 (03PS2) 10Gehel: mediawiki: remove unused logging configuration of mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304)
[17:29:55] <wikibugs>	 (03PS1) 10Ottomata: Add druid defaults for easier setup in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/405021
[17:31:39] <ema>	 !log cache_upload: repool cp3039 (varnish 5)
[17:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:09] <wikibugs>	 (03CR) 10Gehel: [C: 032] mediawiki: remove unused logging configuration of mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel)
[17:33:45] <jynus>	 !log starting compare.py on s3 codfw (it triggered db2036 crash before)
[17:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:41] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add druid defaults for easier setup in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/405021 (owner: 10Ottomata)
[17:35:46] <wikibugs>	 (03PS2) 10Ottomata: Add druid defaults for easier setup in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/405021
[17:36:50] <wikibugs>	 (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9787/" [puppet] - 10https://gerrit.wikimedia.org/r/405021 (owner: 10Ottomata)
[17:39:40] <moritzm>	 !log rebooting sodium (and temporarily disable icinga-wm due to some expected spam due to clients failing to run apt-get update)
[17:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:49] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Cleanup multiple definitions of logstash endpoint in puppet / hiera - https://phabricator.wikimedia.org/T182304#3910343 (10Gehel) Most references to logstash host are now consolidated in a single variable. There are...
[17:44:47] <moritzm>	 sodium back up and ircecho re-started
[17:46:05] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#3910365 (10herron) CRL is now being checked by the puppet master apache frontends.    There were a few issues encountered after deployment  # puppetmaster1001 cert...
[17:47:03] <volans>	 moritzm: I guess we'll still get some recovery spam in ~30 min, right?
[17:48:19] <moritzm>	 volans: I expected a load of puppet failures, but I'm not seeing any in "Current Network Status" in icinga, seems to be handled more gracefully than anticipated
[17:48:33] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3910381 (10RStallman-legalteam) Sorry I missed this ping through phabricator. Christian Covington's NDA is signed and on file and I'm updating the NDA spreadshe...
[17:49:16] <volans>	 good :)
[17:49:52] <wikibugs>	 (03PS1) 10Chad: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025
[17:51:31] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3910404 (10Imarlier) We should look at the varnish logs to see if we can find other pages that have a similar behavior.  Is this part of T164248
[17:54:59] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3910411 (10RobH)
[17:55:27] <ema>	 !log cache_upload: upgrade cp3044 to varnish 5
[17:55:29] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 346 MB (3% inode=61%)
[17:55:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:02] <wikibugs>	 (03PS2) 10RobH: adding shell user cy534 [puppet] - 10https://gerrit.wikimedia.org/r/403088 (https://phabricator.wikimedia.org/T184473)
[17:58:05] <wikibugs>	 (03CR) 10RobH: [C: 032] adding shell user cy534 [puppet] - 10https://gerrit.wikimedia.org/r/403088 (https://phabricator.wikimedia.org/T184473) (owner: 10RobH)
[17:58:25] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3910419 (10Imarlier)
[17:59:04] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-JobQueue, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3904733 (10Imarlier) `monospaced text`
[18:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T1800).
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:00:17] <ema>	 !log cache_upload: repool cp3044 (varnish 5)
[18:00:26] <wikibugs>	 (03PS2) 10RobH: adds cy534 to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/403090 (https://phabricator.wikimedia.org/T184473)
[18:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:36] <awight>	 ORES isn’t scheduled to cause surprise downtime today.
[18:01:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Ensure all packages are updated when d-i installs security updates [puppet] - 10https://gerrit.wikimedia.org/r/405026
[18:02:17] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910431 (10Reedy)
[18:02:42] <wikibugs>	 (03Draft2) 10Jayprakash12345: Add "Portal" namespace on it.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405024 (https://phabricator.wikimedia.org/T185232)
[18:04:25] <matthiasmullie>	 anyone doing services deploy? can I join in for a scap deploy of /srv/3d2png?
[18:06:25] <wikibugs>	 (03PS2) 10Chad: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025
[18:07:17] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910454 (10MoritzMuehlenhoff) Ops  and Releng are using pwstore, which is just using a simple git repository underneath for storage.  The repo for Ops is stored on a private host and IIRC R...
[18:11:57] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910431 (10demon) We did use a private repo in Phabricator. It's worked great.
[18:12:38] <matthiasmullie>	 alright - would I be disrupting anyone/-thing if I were to scap deploy /srv/3d2png now (or any other reason not to do so - first time updating service so I don't know the process for these things all too well :))
[18:13:09] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910492 (10Reedy) Something like that sounds fine, yeah.  Be nice to do something that isn't wheel re-inventing obviously, so if we can follow what releng do... That works for me :)
[18:14:25] <wikibugs>	 (03PS3) 10Chad: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025
[18:17:03] <matthiasmullie>	 greg-g ^ - would now be a good time to update 3d2png service? (or is there some process I should go through)
[18:17:32] <greg-g>	 matthiasmullie: yup!
[18:17:59] <greg-g>	 matthiasmullie: just !log what you're doing :) (scap does it for you if you give it a message)
[18:18:42] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910505 (10MoritzMuehlenhoff) Some general docs (targeted at ops pwstore, but pretty similar) are at https://office.wikimedia.org/wiki/Pwstore
[18:19:02] <matthiasmullie>	 alright thanks
[18:22:02] <wikibugs>	 (03PS4) 10RobH: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025 (owner: 10Chad)
[18:22:10] <wikibugs>	 (03PS1) 10Sbisson: Hide Flow beta feature everywhere but testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405034 (https://phabricator.wikimedia.org/T184670)
[18:22:25] <logmsgbot>	 !log mlitn@tin Started deploy [3d2png/deploy@74b1ed7]: Updating 3d2png repo
[18:22:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:15] <logmsgbot>	 !log mlitn@tin Finished deploy [3d2png/deploy@74b1ed7]: Updating 3d2png repo (duration: 00m 50s)
[18:23:22] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910543 (10demon) Our process in Releng is identical, save the "where to clone from" bit.
[18:23:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:46] <wikibugs>	 (03CR) 10RobH: [C: 032] Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025 (owner: 10Chad)
[18:25:04] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910431 (10greg) Our repo (https://phabricator.wikimedia.org/source/releng-secrets/repository/master/ ) haha you can't see it! ;) but yeah, it's straight-forward to make private repos in Ph...
[18:28:53] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910552 (10Reedy) Cool. I'll check in with Brian/John that they're ok with doing it like this before we go ahead and actually create anything
[18:31:32] <icinga-wm>	 PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/demon/.pylintrc]
[18:33:20] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910563 (10fdans)
[18:33:22] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 43.51, 36.76, 32.84
[18:33:28] <matt_flaschen>	 greg-g, I'm doing an emergency deploy for https://phabricator.wikimedia.org/T184670 (SD opt-in/opt-out breakage).  We may also have to run a maintenance script later.  CC stephanebisson
[18:33:54] <ema>	 !log cache_upload: upgrade cp3045 to varnish 5
[18:34:01] <matt_flaschen>	 greg-g, first deployed change just disables the Beta Feature on non-talk.  Existing users will keep their current discussion page, but they won't be able to turn on or off.
[18:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:27] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910565 (10Reedy) p:05Triage>03Normal
[18:35:00] <greg-g>	 matt_flaschen: yuck, godspeed.
[18:35:04] <greg-g>	 cc thcipriani ^
[18:35:35] <thcipriani>	 ack
[18:37:36] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910569 (10fdans)
[18:39:12] <ema>	 !log cache_upload: repool cp3045 (varnish 5)
[18:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:32] <logmsgbot>	 !log bsitzmann@tin Started deploy [mobileapps/deploy@669fb5b]: Update mobileapps to 2690899 (T184328 T184557 T177007 T184669 T177430 T185050)
[18:40:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:48] <stashbot>	 T184328: Add revision + tid to references output - https://phabricator.wikimedia.org/T184328
[18:40:48] <stashbot>	 T177007: Should we flatten spans in summary output - https://phabricator.wikimedia.org/T177007
[18:40:49] <stashbot>	 T177430: Develop a Media JSON API - https://phabricator.wikimedia.org/T177430
[18:40:49] <stashbot>	 T184669: Fetching incorrect description data from the Feed API - https://phabricator.wikimedia.org/T184669
[18:40:49] <stashbot>	 T185050: Run comparison of html extracts again - https://phabricator.wikimedia.org/T185050
[18:41:09] <wikibugs>	 10Operations, 10Analytics-Kanban, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492#3910612 (10fdans)
[18:41:11] <wikibugs>	 10Operations, 10Puppet: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239#3910613 (10herron) p:05Triage>03Normal
[18:42:02] <wikibugs>	 (03CR) 10Mattflaschen: [C: 032] Hide Flow beta feature everywhere but testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405034 (https://phabricator.wikimedia.org/T184670) (owner: 10Sbisson)
[18:44:11] <wikibugs>	 (03Merged) 10jenkins-bot: Hide Flow beta feature everywhere but testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405034 (https://phabricator.wikimedia.org/T184670) (owner: 10Sbisson)
[18:44:44] <wikibugs>	 10Operations, 10Puppet: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239#3910632 (10herron)
[18:45:29] <wikibugs>	 10Operations, 10procurement: Give access to S4 (procurement tasks) to Deb Tankersley - https://phabricator.wikimedia.org/T185240#3910634 (10Gehel)
[18:47:07] <wikibugs>	 (03CR) 10jenkins-bot: Hide Flow beta feature everywhere but testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405034 (https://phabricator.wikimedia.org/T184670) (owner: 10Sbisson)
[18:47:35] <logmsgbot>	 !log bsitzmann@tin Finished deploy [mobileapps/deploy@669fb5b]: Update mobileapps to 2690899 (T184328 T184557 T177007 T184669 T177430 T185050) (duration: 07m 03s)
[18:47:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:52] <stashbot>	 T184328: Add revision + tid to references output - https://phabricator.wikimedia.org/T184328
[18:47:52] <stashbot>	 T177007: Should we flatten spans in summary output - https://phabricator.wikimedia.org/T177007
[18:47:52] <stashbot>	 T177430: Develop a Media JSON API - https://phabricator.wikimedia.org/T177430
[18:47:53] <stashbot>	 T184669: Fetching incorrect description data from the Feed API - https://phabricator.wikimedia.org/T184669
[18:47:53] <stashbot>	 T185050: Run comparison of html extracts again - https://phabricator.wikimedia.org/T185050
[18:49:11] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910664 (10Bawolff) Do we actually have shared secrets?  Im fine with the private repo approach (the repo is secret and then pgp on top of that, right?)
[18:51:49] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910667 (10Reedy) >>! In T185236#3910664, @Bawolff wrote: > Do we actually have shared secrets?  Not really at the moment, but I'd imagine there's stuff other teams have that we should prob...
[18:52:47] <wikibugs>	 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3910670 (10elukey) Had an interesting chat with @ayounsi and for the moment it seems that the only format expected in the ne...
[18:54:44] <ema>	 !log cache_upload: upgrade cp3046 to varnish 5
[18:54:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:32] <icinga-wm>	 RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[18:58:29] <logmsgbot>	 !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings.php: T184670: Hide Flow beta feature everywhere but testwiki (duration: 01m 10s)
[18:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:44] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200)
[18:58:44] <stashbot>	 T184670: [wmf.16-regression]  Fatal exception of type "Flow\Exception\InvalidDataException"  for opting out from "Structured Discussions on user talk" - https://phabricator.wikimedia.org/T184670
[18:58:52] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200)
[18:58:52] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200)
[18:59:03] <wikibugs>	 10Operations, 10Patch-For-Review: Decommission host erbium - https://phabricator.wikimedia.org/T185226#3910697 (10Aklapper) @Framawiki: The one you can already find in the patch.
[18:59:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200)
[18:59:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200)
[18:59:23] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200)
[19:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T1900).
[19:00:04] <jouncebot>	 subbu, Trey314159, and edsanders: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:11] <ema>	 !log cache_upload: repool cp3046 (varnish 5)
[19:00:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:13] <ema>	 joal: problems with aqs?
[19:02:15] <matt_flaschen>	 stephanebisson, done and confirmed.  It should be fine, but in the unlikely event it makes things worse, morning SWAT is now.  Ping me on Hangouts if anything goes wrong.  I will try to run the dry-run before our meeting.
[19:02:30] <joal>	 ema: Thanks for pinging, I wouldn't have noticed :( No real problem, will silent the alarm
[19:02:47] <ema>	 alright! :)
[19:04:25] <thcipriani>	 I can SWAT
[19:04:57] <wikibugs>	 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910718 (10greg) >>! In T185236#3910664, @Bawolff wrote: > Im fine with the private repo approach (the repo is secret and then pgp on top of that, right?)  Right.
[19:05:12] <thcipriani>	 subbu|lunch: Trey314159 edsanders ping for SWAT if you all are around.
[19:05:26] <Trey314159>	 thcipriani: ack
[19:05:52] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy
[19:05:52] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy
[19:05:53] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy
[19:05:58] <joal>	 ema: Done --^
[19:06:04] <joal>	 ema: Thanks again !
[19:06:13] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy
[19:06:13] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy
[19:06:23] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy
[19:06:48] <subbu|lunch>	  ack
[19:07:36] <wikibugs>	 10Operations, 10Puppet, 10Traffic: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239#3910723 (10ema)
[19:08:00] <logmsgbot>	 !log arlolra@tin Started deploy [parsoid/deploy@fcc2b63]: Updating Parsoid to af06386
[19:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:18] <wikibugs>	 (03CR) 10RobH: [C: 032] adds cy534 to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/403090 (https://phabricator.wikimedia.org/T184473) (owner: 10RobH)
[19:08:23] <wikibugs>	 (03PS3) 10RobH: adds cy534 to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/403090 (https://phabricator.wikimedia.org/T184473)
[19:08:56] <no_justification>	 Hmm, cobalt isn't responding on :22 to me :\
[19:10:32] <icinga-wm>	 PROBLEM - Disk space on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:10:33] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:10:33] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:10:42] <icinga-wm>	 PROBLEM - DPKG on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:10:50] <no_justification>	 Nvm, user error
[19:10:53] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:10:54] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:11:02] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:11:02] <wikibugs>	 (03PS2) 10Thcipriani: Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry)
[19:11:07] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry)
[19:11:13] <icinga-wm>	 PROBLEM - configured eth on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:11:22] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:11:23] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:11:32] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:11:42] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:12:12] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:12:12] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:12:22] <icinga-wm>	 PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:13:13] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:13:22] <icinga-wm>	 PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:13:23] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:13:25] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3910743 (10RobH) a:05cy534>03None
[19:13:50] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3884118 (10RobH) All access patches merged.   It can take up to 30 minutes for affected hosts to receive the update.  If you have any questions or issues, please feel free to reopen...
[19:13:52] <icinga-wm>	 PROBLEM - puppet last run on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:13:57] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3910748 (10RobH) 05Open>03Resolved a:03RobH
[19:14:13] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:15:03] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:15:03] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:15:13] <icinga-wm>	 PROBLEM - configured eth on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:15:22] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[19:15:22] <icinga-wm>	 RECOVERY - dhclient process on ms-be2023 is OK: PROCS OK: 0 processes with command name dhclient
[19:15:22] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2023 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[19:15:23] <icinga-wm>	 RECOVERY - Disk space on ms-be2023 is OK: DISK OK
[19:15:23] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[19:15:32] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[19:15:32] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[19:15:32] <icinga-wm>	 RECOVERY - DPKG on ms-be2023 is OK: All packages OK
[19:15:32] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2023 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[19:15:52] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2023 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[19:15:53] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[19:15:53] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[19:15:53] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2023 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[19:15:55] <wikibugs>	 (03Merged) 10jenkins-bot: Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry)
[19:16:03] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2023 is OK: OK - load average: 37.94, 40.40, 36.91
[19:16:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2023 is OK: OK ferm input default policy is set
[19:16:12] <icinga-wm>	 RECOVERY - configured eth on ms-be2023 is OK: OK - interfaces up
[19:16:13] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2023 is OK: OK: nf_conntrack is 3 % full
[19:16:13] <icinga-wm>	 RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[19:16:51] <thcipriani>	 subbu: your change is on mwdebug1002, anything to check there?
[19:17:00] <subbu>	 not really. 
[19:17:01] <wikibugs>	 (03CR) 10jenkins-bot: Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry)
[19:17:19] <subbu>	 you can proceed. i'll watch db stats in a few hours to see if the failures have gone down.
[19:17:32] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@fcc2b63]: Updating Parsoid to af06386 (duration: 09m 32s)
[19:17:39] <thcipriani>	 subbu: kk, doing.
[19:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:06] <wikibugs>	 10Operations, 10procurement: Give access to S4 (procurement tasks) to Deb Tankersley - https://phabricator.wikimedia.org/T185240#3910766 (10RobH) 05Open>03Resolved For those following along, this is specifically for acl*procurement-review which is a group that I maintain.  I've reviewed and added @debt as...
[19:18:22] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:18:42] <icinga-wm>	 RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures
[19:19:42] <wikibugs>	 (03PS9) 10Thcipriani: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[19:20:06] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404233|Update linter stats for commonswiki less frequently]] T184280 (duration: 01m 13s)
[19:20:13] <thcipriani>	 subbu: ^ live
[19:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:17] <stashbot>	 T184280: Linter multiple database issues - https://phabricator.wikimedia.org/T184280
[19:20:33] <subbu>	 ty
[19:20:42] <ema>	 !log cache_upload: upgrade cp3049 to varnish 5
[19:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:16] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[19:24:15] <arlolra>	 !log Updated Parsoid to af06386 (T45094)
[19:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:28] <stashbot>	 T45094: Parsoid: References should be wrapped in a <sup>, not a <span> - https://phabricator.wikimedia.org/T45094
[19:24:45] <wikibugs>	 (03CR) 10Dzahn: "thanks for merging, and that ticket. glad it wasn't just me, didn't get why :)" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[19:25:44] <wikibugs>	 10Operations, 10procurement: Give access to S4 (procurement tasks) to Deb Tankersley - https://phabricator.wikimedia.org/T185240#3910846 (10debt) Thanks for the add, @RobH, and I understand. :)
[19:25:57] <wikibugs>	 (03Merged) 10jenkins-bot: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[19:26:32] <thcipriani>	 Trey314159: ^ is live on mwdebug1002, check please
[19:26:38] <Trey314159>	 will do
[19:27:07] <wikibugs>	 (03CR) 10jenkins-bot: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[19:28:38] <Trey314159>	 thcipriani: the transliteration menu showed up, but clicking it is giving errors. Test failed. :(
[19:28:59] * subbu realizes what the 314159 in Trey314159 is about after seeing it so many times and not paying attention :)
[19:29:10] <wikibugs>	 (03PS5) 10Eevans: cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284)
[19:29:31] <thcipriani>	 Trey314159: ok, should I go ahead and revert or is there any other testing you need to do?
[19:30:03] <Trey314159>	 thcipriani: can you give me a minute to look around and then revert?
[19:30:37] <thcipriani>	 yep, I'll make the revert but leave it up on mwdebug1002 for a few.
[19:31:51] <edsanders>	 sorry, I'm here if needed
[19:32:22] <thcipriani>	 right on time, was just about to merge your patch :)
[19:32:23] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:35:42] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Updates to enable transliteration for crhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405043
[19:37:45] <wikibugs>	 (03CR) 10Eevans: cassandra: create parent data directories with exec (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans)
[19:38:05] <Trey314159>	 thcipriani: thanks. I'm not sure what happened, but I got some data for looking into it.
[19:38:09] <wikibugs>	 (03CR) 10Eevans: "Updated [PC output](http://puppet-compiler.wmflabs.org/9788/)" [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans)
[19:38:18] <thcipriani>	 Trey314159: sure thing, glad it was useful :)
[19:38:29] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Revert "Updates to enable transliteration for crhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405043 (owner: 10Thcipriani)
[19:40:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Updates to enable transliteration for crhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405043 (owner: 10Thcipriani)
[19:40:47] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Updates to enable transliteration for crhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405043 (owner: 10Thcipriani)
[19:41:01] <wikibugs>	 (03PS3) 10Dzahn: releases: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad)
[19:43:20] <thcipriani>	 edsanders: your change is live on mwdebug1002, check please
[19:43:52] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:43:53] <icinga-wm>	 PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:44:02] <icinga-wm>	 PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:44:03] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:44:22] <icinga-wm>	 PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:44:52] <icinga-wm>	 RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up
[19:44:53] <icinga-wm>	 RECOVERY - DPKG on pybal-test2001 is OK: All packages OK
[19:45:02] <icinga-wm>	 RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient
[19:45:03] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set
[19:45:09] <edsanders>	 thcipriani: how do I do that :)
[19:45:22] <icinga-wm>	 RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full
[19:45:39] <wikibugs>	 (03PS4) 10Dzahn: releases: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad)
[19:45:46] * thcipriani digs for the documentation
[19:46:16] <thcipriani>	 edsanders: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug
[19:46:37] <wikibugs>	 (03PS1) 10Tjones: Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048
[19:47:19] <wikibugs>	 (03PS1) 10Kaldari: Removing unused citizendium from $wgRelatedSitesPrefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405049 (https://phabricator.wikimedia.org/T185246)
[19:48:46] <wikibugs>	 (03CR) 10Tjones: "The deployment of the transliteration in mediawiki-config did not work, so we should undo this config until we know what's going on." [puppet] - 10https://gerrit.wikimedia.org/r/405048 (owner: 10Tjones)
[19:49:34] <wikibugs>	 (03PS2) 10Kaldari: Removing unused citizendium from $wgRelatedSitesPrefixes... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405049 (https://phabricator.wikimedia.org/T185246)
[19:50:05] <edsanders>	 thcipriani: looks good to me
[19:50:27] <thcipriani>	 okie doke, pushing out everywhere
[19:51:20] <wikibugs>	 (03CR) 10Dzahn: [C: 032] releases: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad)
[19:53:23] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.17/extensions/VisualEditor/modules/ve-mw/ui/pages/ve.ui.MWTemplatePlaceholderPage.js: SWAT: [[gerrit:405028|Update TitleInput getTitle to getMWTitle]] (duration: 01m 09s)
[19:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:36] <thcipriani>	 ^ edsanders live everywhere now, thanks!
[19:53:56] <edsanders>	 thanks
[19:54:19] <logmsgbot>	 !log arlolra@tin Started deploy [parsoid/deploy@8736b8c]: Updating Parsoid config
[19:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:37] <wikibugs>	 (03CR) 10Dzahn: "tin: created rsyncd config  releases1001: created rsync cron to pull  naos: nothing   releases2001: nothing    good!" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad)
[19:56:21] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@8736b8c]: Updating Parsoid config (duration: 02m 01s)
[19:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:45] <wikibugs>	 (03CR) 10Dzahn: "follow-up needed, coming up in a minute" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad)
[19:59:15] <logmsgbot>	 !log arlolra@tin Started deploy [parsoid/deploy@8736b8c]: (no justification provided)
[19:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:00] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@8736b8c]: (no justification provided) (duration: 00m 44s)
[20:00:04] <jouncebot>	 thcipriani: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T2000).
[20:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[20:00:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:12] <logmsgbot>	 !log arlolra@tin Started deploy [parsoid/deploy@8736b8c]: (no justification provided)
[20:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:28] <thcipriani>	 working on it, jouncebot. working on it...
[20:01:21] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@8736b8c]: (no justification provided) (duration: 01m 09s)
[20:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:32] <icinga-wm>	 PROBLEM - SSH on ms-be2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:02:42] <icinga-wm>	 PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:02:42] <icinga-wm>	 PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:04:22] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2023 is CRITICAL: CRITICAL - load average: 180.95, 108.02, 64.29
[20:04:42] <icinga-wm>	 PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:05:58] <wikibugs>	 (03PS1) 10Dzahn: releases: ensure /srv/patches directory exists [puppet] - 10https://gerrit.wikimedia.org/r/405052
[20:06:22] <icinga-wm>	 RECOVERY - SSH on ms-be2023 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0)
[20:06:33] <icinga-wm>	 RECOVERY - dhclient process on ms-be2023 is OK: PROCS OK: 0 processes with command name dhclient
[20:06:33] <icinga-wm>	 RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[20:07:00] <wikibugs>	 (03CR) 10Dzahn: [C: 032] releases: ensure /srv/patches directory exists [puppet] - 10https://gerrit.wikimedia.org/r/405052 (owner: 10Dzahn)
[20:07:11] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/405052/" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad)
[20:09:22] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.17/extensions/Score/includes/Score.php: SWAT: [[gerrit:405029|Always pass FileBackend instance to `new FileRepo()`]] T185204 (duration: 01m 12s)
[20:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:35] <stashbot>	 T185204: Score causes InvalidArgumentException from line 183 of /srv/mediawiki/php-1.31.0-wmf.17/includes/filebackend/FileBackendGroup.php: No backend defined with the name ``. - https://phabricator.wikimedia.org/T185204
[20:09:49] <mutante>	 !log releases1001 - /srv/patches got created, initial manual rsync using /usr/local/sbin/sync-srv-patches created by rsync::quickdatacopy, mw patches exists on nightlies server now
[20:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:23] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2023 is OK: OK - load average: 32.84, 79.74, 70.31
[20:17:24] <ejegg>	 Hi all!
[20:18:03] <ejegg>	 If I want to tunnel in to prometheus1001 to use the prometheus web ui directly, which credentials should I use?
[20:18:26] <ejegg>	 Or how should I request access?
[20:20:16] <no_justification>	 paladox: Hmmm https://phabricator.wikimedia.org/P6616
[20:20:26] <no_justification>	 I wonder if my postdata is busted
[20:20:31] <paladox>	 requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://gerrit.wikimedia.org/r/a/projects/mediawiki%2Fcore/branches/test
[20:20:44] <paladox>	 that branch exists
[20:20:46] <paladox>	 so hmm
[20:21:30] <no_justification>	 Duhhhhhhhhh
[20:21:38] <no_justification>	 I'm a dumbass
[20:21:38] <paladox>	 no_justification i wonder are you trying to fetch the branch or create it?
[20:21:46] <no_justification>	 Data should just be a dict.
[20:21:50] <no_justification>	 Not fake JSON
[20:21:51] <paladox>	 oh
[20:21:54] <wikibugs>	 (03PS1) 10Thcipriani: All wikis to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405055
[20:21:57] * no_justification smacks self
[20:22:39] <paladox>	 (Note that /projects/ is being renamed to /repo<ository or s>/ :))
[20:22:41] <thcipriani>	 "scappish" :)
[20:23:43] <wikibugs>	 (03CR) 10Andrew Bogott: rabbitmq: handling users and initial setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush)
[20:23:52] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:23:55] <no_justification>	 thcipriani: Where I put scap-related things!
[20:24:00] <no_justification>	 Scap-ish things!
[20:24:01] <no_justification>	 :p
[20:24:04] <thcipriani>	 makes sense
[20:24:09] <thcipriani>	 :)
[20:24:31] <no_justification>	 https://gerrit.wikimedia.org/r/405056 should fix things
[20:24:35] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] All wikis to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405055 (owner: 10Thcipriani)
[20:24:53] <icinga-wm>	 PROBLEM - Disk space on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:03] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:03] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:32] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:32] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:42] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:42] <icinga-wm>	 PROBLEM - configured eth on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:52] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:52] <icinga-wm>	 PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:25:52] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:26:12] <wikibugs>	 (03Merged) 10jenkins-bot: All wikis to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405055 (owner: 10Thcipriani)
[20:26:12] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:26:33] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:26:33] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:27:11] <wikibugs>	 (03CR) 10jenkins-bot: All wikis to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405055 (owner: 10Thcipriani)
[20:27:53] <icinga-wm>	 PROBLEM - puppet last run on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:27:53] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:28:12] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:28:12] <icinga-wm>	 PROBLEM - DPKG on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:28:21] <no_justification>	 ms-be2023 seems unhappy
[20:28:39] <paladox>	 godog ^^.
[20:28:52] <icinga-wm>	 PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:29:02] <icinga-wm>	 PROBLEM - Disk space on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:29:43] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:29:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:29:52] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:29:52] <icinga-wm>	 PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:30:12] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:30:12] <icinga-wm>	 PROBLEM - DPKG on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:30:42] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:30:54] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:31:11] <mutante>	 it will recover soon
[20:31:17] <logmsgbot>	 !log thcipriani@tin rebuilt and synchronized wikiversions files: All wikis to 1.31.0-wmf.17
[20:31:19] <mutante>	 it's syncing
[20:31:25] <mutante>	 and really busy, but not dead
[20:31:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:53] <icinga-wm>	 PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:32:32] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[20:32:32] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[20:32:32] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2023 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[20:32:33] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2023 is OK: OK: nf_conntrack is 4 % full
[20:32:33] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[20:32:33] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2023 is OK: OK ferm input default policy is set
[20:32:33] <icinga-wm>	 RECOVERY - configured eth on ms-be2023 is OK: OK - interfaces up
[20:32:43] <icinga-wm>	 RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures
[20:32:43] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:32:43] <icinga-wm>	 RECOVERY - dhclient process on ms-be2023 is OK: PROCS OK: 0 processes with command name dhclient
[20:32:43] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[20:32:43] <icinga-wm>	 RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[20:32:52] <icinga-wm>	 RECOVERY - Disk space on ms-be2023 is OK: DISK OK
[20:32:53] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2023 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[20:33:02] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[20:33:02] <icinga-wm>	 RECOVERY - DPKG on ms-be2023 is OK: All packages OK
[20:33:02] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[20:33:02] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2023 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[20:33:02] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[20:33:22] <wikibugs>	 (03PS1) 10Phantom42: mediawiki: Better error page layout on mobile devices [puppet] - 10https://gerrit.wikimedia.org/r/405058 (https://phabricator.wikimedia.org/T182247)
[20:36:23] <icinga-wm>	 PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:40:22] <icinga-wm>	 PROBLEM - carbon-cache@h service on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[20:41:22] <icinga-wm>	 RECOVERY - carbon-cache@h service on graphite1003 is OK: OK - carbon-cache@h is active
[20:41:33] <wikibugs>	 (03PS2) 10Awight: [DO NOT MERGE] Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071)
[20:43:49] <wikibugs>	 (03CR) 10Gehel: "@Tjones: could you add the reason for the revert in the commit message? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (owner: 10Tjones)
[20:53:00] <wikibugs>	 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission host erbium - https://phabricator.wikimedia.org/T185226#3911110 (10Peachey88)
[20:53:05] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to stat1004, stat1005, stat1006 for mneisler - https://phabricator.wikimedia.org/T184838#3911111 (10RobH) Please note we're still awaiting your managers sign off on this task.  Once that is in, we should be able to process this.  Thanks!
[20:53:40] <logmsgbot>	 !log arlolra@tin Started deploy [parsoid/deploy@a95fede]: Update Parsoid config, again
[20:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:19] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@a95fede]: Update Parsoid config, again (duration: 09m 39s)
[21:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:23] <icinga-wm>	 RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[21:22:04] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for decom host erbium [dns] - 10https://gerrit.wikimedia.org/r/405012 (https://phabricator.wikimedia.org/T185226) (owner: 10Cmjohnson)
[21:28:52] <wikibugs>	 (03PS3) 10Chad: Gerrit 2.14.6 [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120)
[21:31:28] <wikibugs>	 10Operations: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3911209 (10Dzahn)
[21:33:59] <wikibugs>	 10Operations: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3911213 (10Dzahn)
[21:34:33] <wikibugs>	 10Operations: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3884434 (10Dzahn) a:05Dzahn>03None
[21:34:47] <wikibugs>	 10Operations, 10hardware-requests: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3884434 (10Dzahn)
[21:35:25] <wikibugs>	 (03PS24) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738
[21:36:33] <wikibugs>	 (03PS2) 10Tjones: Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (https://phabricator.wikimedia.org/T23582)
[21:36:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[21:40:01] <wikibugs>	 (03CR) 10Paladox: [C: 031] Gerrit 2.14.6 [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad)
[21:44:52] <wikibugs>	 (03PS3) 10Tjones: Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (https://phabricator.wikimedia.org/T23582)
[21:47:37] <wikibugs>	 (03CR) 10Rush: rabbitmq: handling users and initial setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush)
[21:48:14] <wikibugs>	 (03PS4) 10Tjones: Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (https://phabricator.wikimedia.org/T23582)
[21:50:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] "I'm convinced" [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush)
[22:04:47] <wikibugs>	 (03PS1) 10Andrew Bogott: role::labs::mediawiki_vagrant:  Warn if not on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/405203
[22:13:54] <wikibugs>	 (03PS1) 10Dduvall: Add service-checker image used to test service images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205
[22:15:27] <wikibugs>	 (03PS1) 10EBernhardson: Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250)
[22:15:33] <wikibugs>	 (03PS2) 10BryanDavis: role::labs::mediawiki_vagrant:  Warn if not on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) (owner: 10Andrew Bogott)
[22:16:04] <no_justification>	 Uh crap. I think I broke archiva. It's returning 502 bad gateway
[22:16:12] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] "Added references to the bug report so we remember to clean this up when I finally get around to fixing the problem." [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) (owner: 10Andrew Bogott)
[22:16:40] <wikibugs>	 (03CR) 10Paladox: "I found the error see https://phabricator.wikimedia.org/T180377#3911321 maybe easier to fix it in lxc." [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) (owner: 10Andrew Bogott)
[22:16:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250) (owner: 10EBernhardson)
[22:16:53] <wikibugs>	 (03PS2) 10Dduvall: Add service-checker image used to test service images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220)
[22:17:06] <paladox>	 no_justification works for me
[22:17:11] <no_justification>	 Yeah nvm it's back
[22:20:18] <wikibugs>	 (03PS4) 10Dduvall: Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016
[22:22:48] <wikibugs>	 (03CR) 10Dduvall: Add service-checker image used to test service images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) (owner: 10Dduvall)
[22:23:24] <wikibugs>	 (03Draft1) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208
[22:23:26] <wikibugs>	 (03PS2) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208
[22:23:43] <wikibugs>	 (03PS2) 10EBernhardson: Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250)
[22:26:47] <wikibugs>	 (03PS3) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208
[22:34:06] <wikibugs>	 (03PS4) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377)
[22:34:40] <wikibugs>	 (03CR) 10Paladox: "This allows puppet to run locally for me on a stretch instance." [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox)
[22:34:49] <wikibugs>	 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3911375 (10herron) This week @chasemp discovered an issue where `puppet apply` breaks on trusty hosts with `Error: Evaluation Error: Error while evaluating a Function Call, uninitialized constant RGen::ECore::ELong`....
[22:34:52] <wikibugs>	 (03CR) 10BryanDavis: lxc: Fix support for stretch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox)
[22:35:23] <wikibugs>	 (03CR) 10BryanDavis: "> This allows puppet to run locally for me on a stretch instance." [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox)
[22:36:50] <herron>	 !log added ruby-rgen-0.7.0-1 (backported package from jessie) to trusty-wikimedia apt repo (T182894)
[22:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:37:04] <stashbot>	 T182894: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894
[22:37:25] <wikibugs>	 (03PS5) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377)
[22:37:36] <wikibugs>	 (03CR) 10Paladox: lxc: Fix support for stretch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox)
[22:38:31] <wikibugs>	 (03CR) 10Faidon Liambotis: lxc: Fix support for stretch (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox)
[22:39:28] <paravoid>	 ah, there's a new PS
[22:40:50] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] lxc: Fix support for stretch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox)
[22:50:23] <wikibugs>	 10Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-General-or-Unknown: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000#3911411 (10TTO)
[22:52:42] <wikibugs>	 (03PS6) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377)
[22:54:27] <wikibugs>	 (03CR) 10Paladox: lxc: Fix support for stretch (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox)
[22:56:09] <wikibugs>	 (03CR) 10BryanDavis: lxc: Fix support for stretch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox)
[23:01:17] <wikibugs>	 (03PS7) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377)
[23:11:52] <urandom>	 !log bootstrapping restbase1015-b -- T184100
[23:12:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:05] <stashbot>	 T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100
[23:30:35] <wikibugs>	 10Operations: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3911551 (10Dzahn)
[23:31:04] <wikibugs>	 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3884422 (10Dzahn)
[23:33:06] <wikibugs>	 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3911556 (10Dzahn)
[23:33:20] <wikibugs>	 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3884422 (10Dzahn) a:05Dzahn>03None
[23:38:31] <wikibugs>	 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#3911588 (10Dzahn)
[23:42:53] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 10.00, 18.13, 23.44