[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T0000). [00:00:05] Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:54] \o [00:01:42] (03PS1) 10Ppchelko: Remove wmgDebugJobQueueEventBus config parameter. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 [00:04:10] (03CR) 10Dzahn: [C: 032] Phabricator: Add translations library to phabricator profile [puppet] - 10https://gerrit.wikimedia.org/r/404887 (https://phabricator.wikimedia.org/T225) (owner: 1020after4) [00:04:47] (03CR) 10Dzahn: [C: 032] "cool project to translate phab with translatewiki.net and cute low ticket number" [puppet] - 10https://gerrit.wikimedia.org/r/404887 (https://phabricator.wikimedia.org/T225) (owner: 1020after4) [00:05:34] Thanks mutante! [00:05:46] twentyafterfour: are you able to swat? [00:05:57] jdlrobson: I can [00:06:03] thank you :) [00:06:23] twentyafterfour: yw [00:06:59] jdlrobson: in the order you put them on the deploy calendar? [00:07:11] twentyafterfour: yes please [00:12:34] waiting for jenkins... [00:14:39] :) [00:18:19] (03PS1) 10Chad: WIP: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892 [00:20:08] (03CR) 10Volans: [C: 031] "LGTM! Thanks for the fixes, and feel free to ignore my nitpicks comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [00:20:40] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10RobH) p:05Triage>03Normal [00:21:10] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908288 (10RobH) The port mappings for onsite use on the existing mr1-eqiad (and likely to match on new device unless @ayounsi advises otherwise): ge-0/0/0 Core: msw1-eqiad:ge-0/0/32 ge-0/0/1 Core: asw-a-eqi... [00:23:04] jdlrobson: ok I've sync'd with mwdebug1002 can you test there? [00:23:39] twentyafterfour: on it.. [00:24:41] twentyafterfour: looks good to me [00:25:07] ok do we need to test the wmf.17 patch separately? [00:25:33] if not I'll sync them both out together [00:26:16] sync together should be fine [00:26:35] (03CR) 1020after4: [C: 032] Use the correct Pashto Wikipedia wordmark on mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404828 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson) [00:26:54] I'll have you test the 3rd patch then I'll sync all 3 [00:29:14] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/9773/" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad) [00:29:30] (03Merged) 10jenkins-bot: Use the correct Pashto Wikipedia wordmark on mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404828 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson) [00:29:43] (03CR) 10jenkins-bot: Use the correct Pashto Wikipedia wordmark on mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404828 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson) [00:30:20] (03PS2) 10Chad: WIP: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892 [00:31:00] jdlrobson: ok the pashto wordmark patch is on mwdebug1002 [00:31:07] twentyafterfour: on it. [00:32:02] twentyafterfour: gimme 5 mins to confirm with a designer :) [00:32:44] ok [00:33:49] twentyafterfour good to go! https://usercontent.irccloud-cdn.com/file/V7oVzShb/image.png [00:33:57] ahh gif to png fail [00:34:09] https://media1.giphy.com/media/xUOxf4Qm1KX6RuUuxa/giphy-downsized.gif [00:35:32] !log twentyafterfour@tin Started scap: Evening SWAT [00:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:27] thx twentyafterfour :) [00:40:06] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3908314 (10Tgr) ``` tgr@deployment-mediawiki04:~$ telnet deployment-redis01.deployment-prep.eqiad.wmflabs 6379 Trying 10.68.16.177... telnet: Un... [00:41:15] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0],Exec[absent_ensure_members] [00:42:55] (03CR) 10Dzahn: [C: 031] "looks good. rsync on tin, cron on releases1001, none releases2001" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad) [00:45:02] jdlrobson: you're welcome [00:45:22] twentyafterfour: scap is still running right? [00:47:47] right [00:48:08] it's on canaries now [00:49:32] jdlrobson: now it's syncing proxies so it's almost there [00:55:05] ok it's done syncing, just doing cdb rebuild now [00:58:05] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:58:24] PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:58:31] twentyafterfour: awesome.. seeing the logo etc now [00:58:34] PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:59:04] PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:59:14] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:59:35] PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:59:44] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:00:01] !log twentyafterfour@tin Finished scap: Evening SWAT (duration: 24m 29s) [01:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T0100). [01:00:04] No GERRIT patches in the queue for this window AFAICS. [01:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:56] jdlrobson: sweet, that concludes this Evening SWAT. [01:01:15] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:02:00] !log Evening SWAT completed. Starting phabricator deployment of #phabricator-2018-07-17 [release/2017-01-17/1] [01:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:44] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [01:02:45] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [01:03:04] RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient [01:03:05] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [01:03:14] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [01:03:25] RECOVERY - Disk space on pybal-test2001 is OK: DISK OK [01:03:34] RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational [01:03:35] RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full [01:03:44] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [01:04:09] Phabricator will be offline for a short time. [01:08:33] !log phabricator deployment finished without incident. [01:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:15] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:13:00] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3908373 (10Tgr) So this is partially my fault, sorry :/ I started redis with `sudo service redis-server start` but the redis service that is pro... [01:26:04] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3908393 (10Tgr) p:05High>03Normal a:03Tgr >>! In T185055#3908373, @Tgr wrote: > nutcracker is still dead :( It just needed a restart, so... [01:31:18] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3908399 (10Tgr) Also I have zero clue why `redis-cli -s /var/run/nutcracker/redis_eqiad.sock -a ` and `redis-cli -s /var/run/n... [02:27:50] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.16) (duration: 07m 18s) [02:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:44] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 863.40 seconds [03:56:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 196.05 seconds [04:07:04] PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:07:15] PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:07:24] PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:07:34] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:07:44] PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:07:45] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:07:54] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:08:15] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:10:14] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [04:14:24] RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational [04:14:34] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [04:14:44] RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient [04:14:54] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [04:14:54] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [04:15:04] RECOVERY - Disk space on pybal-test2001 is OK: DISK OK [04:15:15] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [04:15:15] RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full [04:18:15] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [06:03:55] (03PS1) 10Groovier1: Adding config for WikimediaEvents module for logging behaviour data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404910 [06:07:54] (03PS1) 10Rxy: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) [06:12:52] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912 [06:12:56] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912 [06:15:30] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912 (owner: 10Marostegui) [06:16:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912 (owner: 10Marostegui) [06:17:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404912 (owner: 10Marostegui) [06:18:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 - T174569 (duration: 01m 13s) [06:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:58] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:21:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404913 (https://phabricator.wikimedia.org/T174569) [06:23:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404913 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:24:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404913 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:27:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404913 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:27:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 - T174569 (duration: 01m 12s) [06:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:16] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:27:27] !log Deploy schema change on s8 db1087 (sanitarium master) with replication (this will generate lag on labs servers) - T174569 [06:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:18] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:36:04] (03PS2) 10Rxy: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) [06:52:26] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy) [06:53:46] (03CR) 10jerkins-bot: [V: 04-1] Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy) [06:59:30] (03PS3) 10Rxy: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) [07:01:46] (03CR) 10Rxy: "> Uploaded patch set 3." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy) [07:05:38] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy) [07:07:47] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy) [07:10:28] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:40:19] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:44:45] (03PS3) 10Giuseppe Lavagetto: Add Python 3 support [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 (owner: 10Volans) [08:02:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [08:08:49] (03PS1) 10Muehlenhoff: Also add samwalton9, samtar to absented group [puppet] - 10https://gerrit.wikimedia.org/r/404933 [08:09:36] (03CR) 10Muehlenhoff: [C: 032] Also add samwalton9, samtar to absented group [puppet] - 10https://gerrit.wikimedia.org/r/404933 (owner: 10Muehlenhoff) [08:16:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:17:49] (03CR) 10Ema: [C: 032] Add unit test cases for Server [debs/pybal] - 10https://gerrit.wikimedia.org/r/404704 (owner: 10Mark Bergsma) [08:18:48] (03CR) 10Ema: [C: 032] Separate out coordinator.Server into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/404713 (owner: 10Mark Bergsma) [08:19:26] (03CR) 10Ema: [C: 032] Expand test coverage of server.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/404762 (owner: 10Mark Bergsma) [08:27:28] (03CR) 10Gergő Tisza: Adding config for WikimediaEvents module for logging behaviour data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404910 (owner: 10Groovier1) [08:27:35] (03CR) 10Gergő Tisza: [C: 04-1] Adding config for WikimediaEvents module for logging behaviour data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404910 (owner: 10Groovier1) [08:30:20] !log reboot iron for kernel security update [08:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:21] !log bootstrap cassandra-c on restbase1013 [08:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404936 (https://phabricator.wikimedia.org/T162807) [08:44:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404936 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:46:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404936 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:46:27] (03CR) 10Ema: "A few comments, looks good in general." (034 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) (owner: 10Mark Bergsma) [08:46:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404936 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:51:53] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron) [08:52:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 - T162807 (duration: 01m 13s) [08:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:46] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:59:05] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3908702 (10jcrespo) [09:02:09] (03PS3) 10Filippo Giunchedi: restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100) [09:03:11] (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [09:05:07] (03PS1) 10Ema: cache_upload: use resp.reason in vtc test cases [puppet] - 10https://gerrit.wikimedia.org/r/404940 (https://phabricator.wikimedia.org/T180433) [09:07:11] !log Stop replication in sync db1089 db1067 - T162807 [09:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:23] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:07:35] (03CR) 10Ema: [C: 032] cache_upload: use resp.reason in vtc test cases [puppet] - 10https://gerrit.wikimedia.org/r/404940 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [09:09:43] (03PS1) 10Lokal Profil: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) [09:10:01] !log reboot alcyone pollux sca2004 poolcounter2002 serpens for PCID/INVPCID CPU feature enabling [09:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:03] (03CR) 10jerkins-bot: [V: 04-1] Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [09:12:05] (03PS2) 10Lokal Profil: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) [09:14:54] (03PS1) 10Marostegui: db-eqiad.php: Remove db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) [09:15:27] (03PS2) 10Marostegui: db-codfw.php: Remove db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) [09:19:09] (03PS1) 10Ema: cache_upload: upgrade cp3034 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404944 (https://phabricator.wikimedia.org/T180433) [09:20:35] !log reboot oresrdb2001 for PCID/INVPCID CPU feature enabling [09:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:27] !log reboot druid1001 for kernel upgrades [09:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:02] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp3034 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404944 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [09:25:48] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp3034 is CRITICAL: connect to address 10.20.0.169 and port 3122: Connection refused [09:26:00] that's me, the host is depooled ^ [09:26:51] !log reimage es2003 to stretch [09:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:48] !log !log Stop replication in sync db1089 and db2048 (codfw master) - T162807 [09:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:59] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:28:55] PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 14 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[cassandra],Package[cassandra/metrics-collector],Package[restbase/deploy] [09:31:47] 10Operations, 10Scap: scap sudo violation on first puppet run - https://phabricator.wikimedia.org/T185189#3908798 (10fgiunchedi) [09:32:25] PROBLEM - Restbase root url on restbase2012 is CRITICAL: connect to address 10.192.48.67 and port 7231: Connection refused [09:32:25] PROBLEM - cassandra-a service on restbase2010 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [09:32:25] PROBLEM - cassandra-a CQL 10.192.32.152:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.152 and port 9042: Connection refused [09:32:56] PROBLEM - puppet last run on restbase2011 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 16 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[restbase/deploy],Service[cassandra-a],Service[cassandra-b],Service[cassandra-c] [09:34:06] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused [09:34:06] PROBLEM - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.68 and port 9042: Connection refused [09:34:06] PROBLEM - cassandra-a SSL 10.192.32.152:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:34:46] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp3034 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.168 second response time [09:35:29] (03PS1) 10Filippo Giunchedi: scap: require sudo rules to be in place before deploy [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189) [09:35:46] PROBLEM - cassandra-b SSL 10.192.16.187:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:35:46] PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:35:47] PROBLEM - cassandra-a service on restbase2011 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [09:36:15] (03PS3) 10Marostegui: db-codfw.php: Remove db2034 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) [09:37:35] PROBLEM - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.153 and port 9042: Connection refused [09:37:35] PROBLEM - cassandra-b service on restbase2010 is CRITICAL: NRPE: Command check_cassandra-b-state not defined [09:37:35] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [09:38:34] !log reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites [09:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:15] PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused [09:39:15] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [09:39:16] PROBLEM - cassandra-b SSL 10.192.32.153:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:40:56] PROBLEM - cassandra-b SSL 10.192.48.69:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:40:56] PROBLEM - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:40:56] PROBLEM - cassandra-b service on restbase2011 is CRITICAL: NRPE: Command check_cassandra-b-state not defined [09:40:58] (03PS2) 10Volans: wmf-auto-reimage: fix host validation logic [puppet] - 10https://gerrit.wikimedia.org/r/404439 (https://phabricator.wikimedia.org/T182702) [09:41:06] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404946 [09:41:36] mmm is this expired downtime godog --^ ? [09:41:53] ugh, yeah thanks elukey [09:42:14] ah super, it was scary :D [09:42:36] PROBLEM - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused [09:42:37] PROBLEM - cassandra-c service on restbase2010 is CRITICAL: NRPE: Command check_cassandra-c-state not defined [09:42:37] PROBLEM - cassandra-b service on restbase2012 is CRITICAL: NRPE: Command check_cassandra-b-state not defined [09:43:13] !log cache_upload: repooled cp3034 running varnish 5 [09:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:02] (03CR) 10Jcrespo: [C: 031] "this is ok, assuming no_raise is set on --no-validate" [puppet] - 10https://gerrit.wikimedia.org/r/404439 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [09:45:24] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404946 (owner: 10Marostegui) [09:46:31] (03CR) 10Volans: [C: 032] "yes, indeed. Thanks for the review." [puppet] - 10https://gerrit.wikimedia.org/r/404439 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [09:48:19] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404946 (owner: 10Marostegui) [09:48:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404946 (owner: 10Marostegui) [09:49:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T162807 (duration: 01m 12s) [09:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:09] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:53:26] RECOVERY - Restbase root url on restbase2012 is OK: HTTP OK: HTTP/1.1 200 - 15785 bytes in 0.084 second response time [09:58:21] !log reboot etcd1006 for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide) [09:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:53] 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10MoritzMuehlenhoff) This is solely for T174110 or are we anticipating other use cases? The high level implementation idea seem... [10:07:39] !log rebooting rdb1002/rdb1004/rdb1006/rdb1008 for kernel security update [10:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:52] !log mobrovac@tin Started deploy [restbase/deploy@04e7cdb]: Use stable packge names, normalise cache-control headers, update top definition - T184199 T184833 T184541 [10:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:07] T184833: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833 [10:08:07] T184199: Discontinue the Cassandra, Sqlite and Spec -ng packages - https://phabricator.wikimedia.org/T184199 [10:08:07] T184541: Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541 [10:10:20] !log mobrovac@tin Finished deploy [restbase/deploy@04e7cdb]: Use stable packge names, normalise cache-control headers, update top definition - T184199 T184833 T184541 (duration: 02m 29s) [10:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:27] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[bump nf_conntrack hash table size] [10:12:53] !log mobrovac@tin Started deploy [restbase/deploy@5c353f7]: Use stable packge names, normalise cache-control headers, update top definition, take #2 - T184199 T184833 T184541 [10:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:50] ]] [10:17:11] 10Operations, 10User-Elukey: Sporadic logrotate issue for stretch mediawiki appservers - https://phabricator.wikimedia.org/T185195#3908938 (10elukey) p:05Triage>03Normal [10:17:29] 10Operations, 10User-Elukey: Sporadic logrotate issue for stretch mediawiki appservers - https://phabricator.wikimedia.org/T185195#3908950 (10elukey) [10:17:32] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3908951 (10elukey) [10:19:06] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3908967 (10elukey) @faidon whenever you have time do you mind to explain a bit what data is currently pushed to the netflow... [10:20:09] PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:20:18] PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:20:19] PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:20:19] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:20:28] PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:20:38] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:20:39] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:21:28] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [10:22:09] RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full [10:22:18] RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational [10:22:19] RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient [10:22:19] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [10:22:29] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [10:22:29] RECOVERY - Disk space on pybal-test2001 is OK: DISK OK [10:22:38] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [10:22:39] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [10:25:11] !log mobrovac@tin Finished deploy [restbase/deploy@5c353f7]: Use stable packge names, normalise cache-control headers, update top definition, take #2 - T184199 T184833 T184541 (duration: 12m 18s) [10:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:25] T184833: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833 [10:25:25] T184199: Discontinue the Cassandra, Sqlite and Spec -ng packages - https://phabricator.wikimedia.org/T184199 [10:25:25] T184541: Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541 [10:25:58] (03CR) 10Marostegui: [C: 032] db-codfw.php: Remove db2034 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [10:27:33] (03Merged) 10jenkins-bot: db-codfw.php: Remove db2034 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [10:27:36] 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10akosiaris) Ι 'll echo Moritz on this one. It does look like adding system users to the admin module adds some complexity and do... [10:27:45] (03CR) 10jenkins-bot: db-codfw.php: Remove db2034 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404943 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [10:29:27] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db2034 from s1 as it will be in x1 - T184888 (duration: 01m 12s) [10:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:40] T184888: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888 [10:30:50] !log reboot etherpad1001 for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide) [10:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:09] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:09] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:09] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:09] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:09] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:28] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:39] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:09] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:38] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:38] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:19] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:19] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:19] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:38] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:38] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:35:43] !log ladsgroup@terbium:/srv/mediawiki/php-1.31.0-wmf.17$ mwscript extensions/WikibaseQualityConstraints/maintenance/ImportConstraintStatements.php --wiki wikidatawiki (T184720) [10:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:56] T184720: Re-import constraints from statements - https://phabricator.wikimedia.org/T184720 [10:36:53] (03PS1) 10Elukey: Allow to set the JAVA_HOME env variable in hadoop/hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/404954 (https://phabricator.wikimedia.org/T166248) [10:38:24] 10Operations, 10Performance-Team (Radar): Add profiling for Varnish and VCL - https://phabricator.wikimedia.org/T175710#3909125 (10Krinkle) [10:40:21] 10Operations, 10Page Content Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3909134 (10mobrovac) 05Open>03Resolved a:03Pchelolo It seems @Pchelolo's no... [10:41:37] (03PS2) 10Elukey: Allow to set the JAVA_HOME env variable in hadoop/hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/404954 (https://phabricator.wikimedia.org/T166248) [10:42:28] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:45:19] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3909156 (10Krinkle) [10:45:49] (03PS1) 10Marostegui: install_server: Allow db2034 reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/404955 (https://phabricator.wikimedia.org/T184888) [10:49:58] (03PS2) 10Marostegui: install_server: Allow db2034 reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/404955 (https://phabricator.wikimedia.org/T184888) [10:57:27] !log reboot actinium.wikimedia.org aluminium.wikimedia.org argon.eqiad.wmnet boron.eqiad.wmnet bromine.eqiad.wmnet darmstadtium.eqiad.wmnet dbmonitor1001.wikimedia.org dubnium.wikimedia.org dysprosium.wikimedia.org etcd1001.eqiad.wmnet etcd1004.eqiad.wmnet fermium.wikimedia.org hassium.eqiad.wmnet kubestagetcd1001.eqiad.wmnet logstash1007.eqiad.wmnet meitnerium.wikimedia.org mendelevium.eqiad.wmnet mwdebug1002.eqiad.wmnet m [10:57:27] x1001.wikimedia.org neon.eqiad.wmnet netmon1003.wikimedia.org planet1001.eqiad.wmnet poolcounter1001.eqiad.wmnet releases1001.eqiad.wmnet roentgenium.eqiad.wmnet rutherfordium.eqiad.wmnet sca1003.eqiad.wmnet ununpentium.wikimedia.org for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide) [10:57:31] heads up ^ [10:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:08] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:58:38] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:59:18] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:19] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:59:38] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:38] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:01:09] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:02:08] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:02:08] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:02:09] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:02:09] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:02:28] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:02:39] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:02:56] 10Operations, 10User-Elukey: Sporadic logrotate issue for stretch mediawiki appservers - https://phabricator.wikimedia.org/T185195#3909201 (10Volans) FYI https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881725 [11:03:38] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:04:18] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:05:18] PROBLEM - etc request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 58689 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:06:19] RECOVERY - etc request latencies on argon is OK: OK - etcd_request_latencies is 1964 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:06:27] !log disabled puppet on tegmen to test impact on puppetdb - T170740 [11:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:38] T170740: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740 [11:22:07] (03CR) 10Mobrovac: [C: 031] "@Ppchelko, could you SWAT this today? Even though we could get it in together with the next migration." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko) [11:34:27] !log reboot logstash1008 etcd1002 kubestagetcd1002.eqiad.wmnet for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide) [11:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:06] (03PS1) 10Arturo Borrero Gonzalez: apt: apt-upgrades: dont fail if new packages are being installed [puppet] - 10https://gerrit.wikimedia.org/r/404963 (https://phabricator.wikimedia.org/T178717) [11:38:50] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrades: dont fail if new packages are being installed [puppet] - 10https://gerrit.wikimedia.org/r/404963 (https://phabricator.wikimedia.org/T178717) (owner: 10Arturo Borrero Gonzalez) [11:46:28] hi [11:46:38] https://fr.wikisource.org/wiki/Le_Roi_d%E2%80%99Yvetot [11:46:58] any idea what's going on here? ^ [11:47:12] [WmCIHQpAMFAAAES6KqkAAACW] 2018-01-18 11:42:24: Erreur fatale de type « InvalidArgumentException » [11:47:28] it's the first time I see such a message [11:50:46] https://phabricator.wikimedia.org/T185204 [11:51:55] <_joe_> yannf: taking a look [11:52:12] php-1.31.0-wmf.17/includes/filebackend/FileBackendGroup.php:183 [11:52:59] <_joe_> yeah looks like a MediaWiki failure [11:53:29] 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3909310 (10elukey) 05Open>03Resolved Closing task since https://grafana.wikimedia.org/dashboard/db/puppetdb is almost a replica of the... [11:54:07] 10Operations, 10Puppet, 10Patch-For-Review: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740#3909312 (10elukey) The puppetdb grafana dashboard (and its related monitoring config for nitrogen/nihal) were added in https://phabricator.wikimedia.org/T184796 [11:59:24] (03PS1) 10Muehlenhoff: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 [12:01:56] (03CR) 10jerkins-bot: [V: 04-1] Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404967 (owner: 10Muehlenhoff) [12:19:30] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3909349 (10Deskana) I don't want deployment of TemplateStyles to be a moving target—as it has been for many months—so I'm going to stick... [12:21:01] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3909354 (10jcrespo) [12:29:19] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:30:42] (03PS5) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647 [12:36:20] !log reboot chlorine.eqiad.wmnet etcd1003.eqiad.wmnet etcd1005.eqiad.wmnet fermium.wikimedia.org install1002.wikimedia.org krypton.eqiad.wmnet kubestagetcd1003.eqiad.wmnet logstash1009.eqiad.wmnet mwdebug1001.eqiad.wmnet sca1004.eqiad.wmnet for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide) [12:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:58] PROBLEM - Check whether ferm is active by checking the default input chain on install1002 is CRITICAL: Return code of 255 is out of bounds [12:39:08] PROBLEM - Squid on install1002 is CRITICAL: connect to address 208.80.154.22 and port 8080: Connection refused [12:39:28] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. [12:39:59] RECOVERY - Check whether ferm is active by checking the default input chain on install1002 is OK: OK ferm input default policy is set [12:40:08] RECOVERY - Squid on install1002 is OK: TCP OK - 0.000 second response time on 208.80.154.22 port 8080 [12:40:40] !log set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot) [12:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:28] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [12:43:03] !log disable puppet across the fleet for nitrogen (puppetdb) reboot [12:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:15] * akosiaris <3 cumin [12:43:52] yw :) [12:43:58] !log bohrium rebooted for kernel upgrades [12:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:32] !log reboot seaborgium for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide) [12:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:41] (03CR) 10Marostegui: [C: 032] install_server: Allow db2034 reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/404955 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [12:50:46] (03PS3) 10Marostegui: install_server: Allow db2034 reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/404955 (https://phabricator.wikimedia.org/T184888) [12:56:37] !log reboot nitrogen for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide) [12:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:58] !log enable puppet across the fleet after nitrogen (puppetdb) reboot [12:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:14] ok I think we are done with KVM vms [12:58:39] PROBLEM - MariaDB Slave IO: s3 on db2050 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused) [12:58:48] PROBLEM - MariaDB Slave IO: s3 on db2036 is CRITICAL: CRITICAL slave_io_state could not connect [12:58:48] PROBLEM - MariaDB Slave IO: s3 on dbstore2001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused) [12:58:48] PROBLEM - MariaDB Slave IO: s3 on db2057 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused) [12:59:06] marostegui: jynus: ^ known ? [12:59:09] PROBLEM - MariaDB Slave SQL: s3 on db2036 is CRITICAL: CRITICAL slave_sql_state could not connect [12:59:09] PROBLEM - MariaDB Slave IO: s3 on db2043 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused) [12:59:18] PROBLEM - MariaDB Slave IO: s3 on db2018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (104 Connection reset by peer) [12:59:19] PROBLEM - MariaDB Slave IO: s3 on db2074 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2036.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2036.codfw.wmnet (111 Connection refused) [13:00:13] connection refused even on the unix socket ? [13:00:31] power mysqld is in the D state [13:01:00] Checking [13:01:50] akosiaris: webperf1001 and poolcounter2001 are KVM instances and still need a reboot [13:01:50] mysql crashed [13:01:58] it is doing recovery [13:02:10] marostegui: ok [13:02:28] moritzm: ok the first one is easy, the second wants a minor depooling, will do [13:02:40] thanks! [13:03:09] we've usually just rebooted the codfw poolcounters without depooling [13:03:15] !log reboot webperf1001 for PCID, INVPCID feature enabling (INVPCID not supported on current hardware, but still enabling it cluster wide) [13:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:28] did s3 master went down? [13:03:44] jynus: db2036 ? yes [13:03:47] mysql crash [13:03:48] yes [13:03:54] not sure we should put it up [13:04:10] I think it has no gtid, and we should promote other master instead [13:04:20] jynus: can it be related to the compare.py running? [13:04:25] moritzm: true. I 'll do so now [13:04:25] it could be [13:04:36] it wasn't running now [13:04:52] but activity == corruption/crash [13:05:16] !log reboot poolcounter2001 for PCID/INVPCID CPU feature enabling [13:05:19] despire execution thers stopped at 2018-01-18T12:59:27.299959 [13:05:23] akosiaris: I'll take care of poolcounter1002 (the only remaining baremetal poolcounter), but CI throws a weird error on https://gerrit.wikimedia.org/r/#/c/404967/ [13:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:28] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:05:45] marostegui: let's stop all replicas there [13:05:48] https://phabricator.wikimedia.org/P6611 [13:06:22] that is my query [13:06:22] I changed my.cnf to make it start with replication stopped as it was starting [13:06:38] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.94 seconds [13:06:58] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.40 seconds [13:07:18] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.88 seconds [13:07:19] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 661.23 seconds [13:07:19] PROBLEM - MariaDB Slave Lag: s3 on db2018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.47 seconds [13:07:42] but look at the query time [13:08:05] ah, so yes, it could be related [13:08:38] or more likely, it could be a trigger [13:08:49] All the slaves are stopped in the same position, so that is good [13:09:00] mysql is broken, it won't start [13:09:03] we have to promote a new host [13:09:05] db2043? [13:09:21] did you stop them manually or should I? [13:09:30] stop them yeah [13:09:32] pfff, don't know [13:09:34] I will prepare the patches [13:09:47] we can promote the old master [13:09:49] 35 didn't have partitioning [13:10:15] so no alter was done there [13:10:29] on the other side, maybe an older host has problems when many objects are queried [13:10:49] yeah, could be [13:10:50] with 35 I meant 36 [13:10:56] let's promote the old master for now? [13:11:06] PROBLEM - mysqld processes on db2036 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [13:11:07] all are on 10.0.33 [13:11:23] going to silence all the codfw hosts [13:11:37] that is ok to me, although we will have to change it again [13:11:38] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0],Exec[absent_ensure_members] [13:11:50] I almost prefer another host [13:12:03] so we reconstruct db2035 from db2018 [13:12:09] *db2036, again [13:12:24] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9778/" [puppet] - 10https://gerrit.wikimedia.org/r/404954 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [13:12:27] sure [13:12:30] (03PS3) 10Elukey: Allow to set the JAVA_HOME env variable in hadoop/hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/404954 (https://phabricator.wikimedia.org/T166248) [13:12:32] your first proposal was ok db2043 [13:12:53] let's go for it [13:13:00] you do the change master and I prepare the patches? [13:13:04] let's double check with db1075 position [13:13:09] I will do that, then [13:13:14] ok, I will prepare the patches [13:13:21] you prepare puppet and mediawiki [13:13:34] yep [13:13:42] the slaves stopped at: 1000278638 [13:13:43] all of them [13:13:47] but double check it [13:14:31] yes, but we need a real master coords [13:14:48] for db2043 [13:15:38] yep [13:16:12] (03PS2) 10Filippo Giunchedi: scap: require sudo rules to be in place before deploy [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189) [13:16:23] let me know when I can restart db2043 for statement based change [13:18:04] (03PS1) 10Marostegui: mariadb: Promote db2043 to be s3 master [puppet] - 10https://gerrit.wikimedia.org/r/404974 [13:19:01] (03CR) 10Giuseppe Lavagetto: [C: 031] scap: require sudo rules to be in place before deploy [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189) (owner: 10Filippo Giunchedi) [13:19:24] !log changing topology of codfw s3 databases [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] jynus: we have to restart db2043, let me know when I can proceed [13:20:33] after the topology change [13:22:28] PROBLEM - Check Varnish expiry mailbox lag on cp3037 is CRITICAL: CRITICAL: expiry mailbox lag is 2062222 [13:22:41] (03PS1) 10Marostegui: db-codfw.php: Promote db2043 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404976 [13:22:43] I am changing to MASTER_LOG_FILE='db2043-bin.003722', MASTER_LOG_POS=379002384 [13:22:50] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/9779/" [puppet] - 10https://gerrit.wikimedia.org/r/404974 (owner: 10Marostegui) [13:22:53] all other replicas [13:23:21] marostegui: ok? [13:23:36] checking that [13:23:37] one sec [13:23:56] 379002384 is correct and so is db2043-bin.003722 [13:24:00] go ahead [13:24:03] thanks [13:24:21] (03CR) 10Filippo Giunchedi: [C: 032] "noop as expected https://puppet-compiler.wmflabs.org/compiler03/9780/" [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189) (owner: 10Filippo Giunchedi) [13:24:30] RECOVERY - MariaDB Slave IO: s3 on db2018 is OK: OK slave_io_state Slave_IO_Running: Yes [13:24:58] RECOVERY - MariaDB Slave IO: s3 on db2050 is OK: OK slave_io_state Slave_IO_Running: Yes [13:24:58] RECOVERY - MariaDB Slave IO: s3 on db2057 is OK: OK slave_io_state Slave_IO_Running: Yes [13:25:19] PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:25:19] PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:25:28] PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:25:29] (03PS3) 10Filippo Giunchedi: scap: require sudo rules to be in place before deploy [puppet] - 10https://gerrit.wikimedia.org/r/404945 (https://phabricator.wikimedia.org/T185189) [13:25:38] RECOVERY - MariaDB Slave IO: s3 on db2074 is OK: OK slave_io_state Slave_IO_Running: Yes [13:25:38] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:25:48] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:25:58] RECOVERY - MariaDB Slave IO: s3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:26:09] PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:26:19] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:27:01] the topology change of lower level is done, I have yet to promote db2043 as master [13:27:14] https://gerrit.wikimedia.org/r/#/c/404974/1 [13:27:18] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [13:27:21] let's restart db2043 with statement [13:27:28] puppet is stopped there [13:27:39] let's merge, stop, run puppet and start mysql [13:28:14] let me gather all pos first [13:28:39] cool [13:29:20] RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient [13:29:20] RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full [13:29:21] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [13:29:21] RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational [13:29:21] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [13:29:30] RECOVERY - Disk space on pybal-test2001 is OK: DISK OK [13:29:40] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [13:29:51] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [13:31:57] (03CR) 10Alexandros Kosiaris: [C: 031] "No, I don't think it's you. Look at the production catalog as well (http://puppet-compiler.wmflabs.org/9768/ganeti1001.eqiad.wmnet/prod.ga" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [13:34:16] (03PS7) 10Alexandros Kosiaris: ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [13:34:45] (03CR) 10Alexandros Kosiaris: [C: 032] "I 'll merge to confirm my suspicion and I 'll be ready for a rollback" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [13:34:53] sorry it is taking me some time, as it is not user facing, I am trying to be through [13:35:14] You have the position from where db2043 needs to replicate on db1075? [13:35:47] db1075-bin.002424 [13:35:47] 432232308 [13:35:49] that is what I got [13:36:59] I am on it [13:39:06] lats commit on binlog is a set of INSERTs INTO `srwiki`.`pagelinks` [13:40:34] 10Operations, 10ops-codfw: Degraded RAID on restbase2007 - https://phabricator.wikimedia.org/T185214#3909578 (10ops-monitoring-bot) [13:40:47] I get 432231034 [13:41:20] (03PS3) 10Awight: Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002 [13:41:29] its gtid is 171966669-171966669-1746907396 [13:41:40] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:41:48] 432231034 for db1075 you mean? [13:41:58] yes [13:42:20] that is around 1K before yours [13:42:29] yep [13:42:29] (03PS1) 10Alexandros Kosiaris: ganeti: Remove KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/404977 [13:42:32] (03PS1) 10Alexandros Kosiaris: ganeti: Remove long absented resource [puppet] - 10https://gerrit.wikimedia.org/r/404978 [13:42:38] let me double check [13:42:51] (03CR) 10Alexandros Kosiaris: "A noop on ganeti1001, no firewall rule changes, the bug seems to rest with the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [13:44:50] 10Operations, 10ops-codfw: Degraded RAID on restbase2007 - https://phabricator.wikimedia.org/T185214#3909585 (10fgiunchedi) 05Open>03Invalid That's me turning HBA mode on as part of {T184100} and forgetting the downtime [13:44:56] marostegui: update on dawiki.page; insert on srwiki.pagelinks (CUT) INSERT on zh_min_nanwiki.modulo_deps [13:45:07] yeah, I was seeing that right ow [13:45:08] now [13:49:14] yeah [13:49:19] I think it is 432231034 [13:49:34] !log upgrade mw* servers in eqiad running 3.18.5+dfsg-1+wmf3 (recent installations) to 3.18.5+dfsg-1+wmf4 [13:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:51] (03PS1) 10Ema: cache_upload: upgrade cp3037 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404979 (https://phabricator.wikimedia.org/T180433) [13:50:26] hm, much of those at wmlog: Cannot access property on non-object in /srv/mediawiki/php-1.31.0-wmf.17/extensions/GlobalBlocking/includes/api/ApiGlobalBlock.php [13:51:10] (03PS1) 10Gehel: mediawiki: remove unused logging configuration of mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) [13:51:33] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp3037 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404979 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [13:51:52] jynus: that UPDATE happened on the slaves then? [13:52:16] I mean, is the row consistent? [13:52:24] I am checking [13:52:43] the insert on srwiki.pagelinks happened [13:52:55] I am now checking that the latest insert didn't happen [13:53:44] 10Operations, 10Puppet, 10puppet-compiler: Puppet compiler failure to lookup some keys - https://phabricator.wikimedia.org/T185215#3909612 (10akosiaris) p:05Triage>03High Triaging as high as this might bite us and cause issues. We should investigate more and act accordingly [13:53:49] (03CR) 10Alexandros Kosiaris: "Tracked in https://phabricator.wikimedia.org/T185215" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [13:54:05] !log cache_upload: upgrade cp3037 to varnish 5 [13:54:09] (03PS2) 10Alexandros Kosiaris: ganeti: Remove KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/404977 [13:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:54] marostegui: the dawiki update did not happen [13:55:26] that is good then [13:55:51] so I have checked it is between 432231034 and 432232308 [13:56:59] (03CR) 10Muehlenhoff: [C: 031] ganeti: Remove KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/404977 (owner: 10Alexandros Kosiaris) [13:57:09] yeah, if nothing after 432232533 happened, then we are safe [13:57:37] but check the new mater binary log [13:57:44] the last insert is not there [13:57:47] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: Remove KSM diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/404977 (owner: 10Alexandros Kosiaris) [13:57:53] db2043 you mean? [13:57:59] yes [13:59:42] yeah, the dawiki update isn't there [13:59:52] but I think it is already inserted, maybe? so the last commit has not been logged? [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T1400). [14:00:05] rxy: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] I can SWAT today [14:00:21] rxy: around for SWAT? [14:00:28] jynus: I am checking the insert on srwiki [14:00:30] zeljkof: hi, Yes [14:00:30] (03CR) 10Gehel: "Puppet compiler agree, this is a noop: https://puppet-compiler.wmflabs.org/compiler02/9781/" [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [14:00:39] or, it has been delated later [14:00:53] rxy: I'll let you know when the commit is at mwdebug1002, it should be there in a few minutes [14:01:06] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy) [14:01:38] I don't see: [14:01:40] use `srwiki`/*!*/; [14:01:40] SET TIMESTAMP=1516280175/*!*/; [14:01:40] INSERT /* LinksUpdate::incrTableUpdate [14:01:43] on the new master binlog [14:02:22] I do [14:02:30] RECOVERY - Check Varnish expiry mailbox lag on cp3037 is OK: OK: expiry mailbox lag is 0 [14:02:36] (03Merged) 10jenkins-bot: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy) [14:02:52] (03CR) 10jenkins-bot: Change autoconfirmed settings and Enable flood group at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404911 (https://phabricator.wikimedia.org/T185182) (owner: 10Rxy) [14:02:57] On: db2043-bin.003722 ? [14:03:19] see etherpad [14:03:23] checking [14:03:28] maybe we are talking about something different [14:03:50] yeah the LinksUpdate::incrTableUpdate gets written differently [14:03:51] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp3037 is CRITICAL: connect to address 10.20.0.172 and port 80: Connection refused [14:04:03] that's me ^ [14:04:42] in row format [14:04:50] (03CR) 10Gehel: "I'm wondering if we might have an unusual use case on WMCS which this might break... Worst case, we should just go to the class default an" [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [14:04:51] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp3037 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.168 second response time [14:05:42] maybe that happened- it was a null REPLACE so it didn't get logged on the new master binlog [14:06:01] rxy: 404911 is at mwdebug1002, please test and let me know if I can deploy [14:06:21] let me know if you don't know how to test there [14:06:28] there is documentation about it [14:06:54] jynus: yeah, could be [14:06:56] you mention linksupdate [14:07:10] yes, for srwiki [14:07:11] I am mentioning pagelinks [14:07:21] zeljkof: https://zh.wikibooks.org/wiki/Special:群组权限?uselang=en with "X-Wikimedia-Debug: mwdebug1002.eqiad.wmnet" ok, It's work for me [14:07:29] yeah, I was talking about: use `srwiki`/*!*/; [14:07:29] SET TIMESTAMP=1516280175/*!*/; [14:07:29] INSERT /* LinksUpdate::incrTableUpdate */ IGNORE INTO `templatelinks` [14:07:46] rxy: ok, deploying [14:08:41] see my pointer [14:08:51] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.32, 35.35, 31.91 [14:09:03] templatelinks, which position? [14:09:32] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404911|Change autoconfirmed settings and Enable flood group at zhwikibooks (T185182)]] (duration: 01m 13s) [14:09:33] 432233002 [14:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:43] T185182: Request to change autoconfirmed settings, allow autoconfirmed user to suppress redirects and allow sysop to grant and remove flood flags on zh.wikibooks - https://phabricator.wikimedia.org/T185182 [14:09:53] rxy: it's deployed, please check and thanks for deploying with #releng :) [14:10:12] yes, that is later [14:10:21] so I think we do not disagree on that [14:10:26] !log cache_upload: repool cp3037 (varnish 5) [14:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:55] !log EU SWAT finished [14:11:00] Cool, the srvwiki.pagelinks I do see on db2043 binlog yeah [14:11:04] so we are agreeing [14:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:11] yeah, I think there is an extra query [14:12:21] that has been executed, but it is not on the binary log [14:12:50] (03PS1) 10Awight: Use a global mysql user if configured [dumps] - 10https://gerrit.wikimedia.org/r/404986 [14:12:51] in any case, we can check that single row after syncyng them [14:12:52] (03PS1) 10Awight: Tolerate empty tables [dumps] - 10https://gerrit.wikimedia.org/r/404987 [14:13:06] or we can reboot the server and it should appear [14:13:06] zeljkof: thanks. I verified configuration in mw1329.eqiad.wmnet. It's work. [14:13:16] let me change the server and we can reboot it [14:13:23] did you merge the patches [14:13:45] Nope, waiting for you [14:13:54] Let's stop mysql, merge, run puppet and start [14:14:05] going to merge now, puppet is stopped on db2043 [14:14:36] (03CR) 10Marostegui: [C: 032] mariadb: Promote db2043 to be s3 master [puppet] - 10https://gerrit.wikimedia.org/r/404974 (owner: 10Marostegui) [14:14:38] (03PS4) 10Awight: Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002 (https://phabricator.wikimedia.org/T185116) [14:14:40] (03PS2) 10Awight: Use a global mysql user if configured [dumps] - 10https://gerrit.wikimedia.org/r/404986 (https://phabricator.wikimedia.org/T185116) [14:14:41] (03PS2) 10Awight: Tolerate empty tables [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116) [14:14:44] (03PS2) 10Marostegui: mariadb: Promote db2043 to be s3 master [puppet] - 10https://gerrit.wikimedia.org/r/404974 [14:16:06] jynus: all merged, puppet remains stopped till you stop mysql [14:16:24] ok, then stopping it, is the alert down? [14:16:26] (03PS1) 10Ema: cache_upload: upgrade cp3035 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404990 (https://phabricator.wikimedia.org/T180433) [14:16:48] yep, I have downtimed all codfw [14:17:08] !log stopping mysql on db2043 [14:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:28] Biplab:Are you here [14:17:49] Jayprakash12345: Yes [14:17:51] PROBLEM - DPKG on mw1347 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:18:01] PROBLEM - DPKG on mw1346 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:18:04] down [14:18:14] ok, enabling puppet and running it [14:18:31] PROBLEM - HHVM rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:50] RECOVERY - DPKG on mw1347 is OK: All packages OK [14:18:51] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 47.39, 37.74, 31.61 [14:19:01] RECOVERY - DPKG on mw1346 is OK: All packages OK [14:19:11] we can start it again [14:19:13] puppet ran [14:19:21] RECOVERY - HHVM rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 75320 bytes in 0.330 second response time [14:19:31] !log starting mysql on db2043 [14:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:58] (03CR) 10Alexandros Kosiaris: [C: 032] network: reword slice_network_constants' errors [puppet] - 10https://gerrit.wikimedia.org/r/403699 (owner: 10Faidon Liambotis) [14:20:07] running puppet to get pt-hearbeat up [14:20:07] (03PS5) 10Alexandros Kosiaris: network: reword slice_network_constants' errors [puppet] - 10https://gerrit.wikimedia.org/r/403699 (owner: 10Faidon Liambotis) [14:20:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] network: reword slice_network_constants' errors [puppet] - 10https://gerrit.wikimedia.org/r/403699 (owner: 10Faidon Liambotis) [14:20:29] hhvm dpkg alerts is me [14:20:39] if you are ok, I will start replication again [14:20:41] PROBLEM - puppet last run on mw1343 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [14:20:44] Lydia_WMDE: https://phabricator.wikimedia.org/T175491 [14:20:47] yeah, go ahead [14:20:59] _joe_: there's some mw servers with high load, what's needed before restarting hhvm? [14:21:07] (03PS2) 10Alexandros Kosiaris: base: wrap lines in check_puppetrun to < 110 [puppet] - 10https://gerrit.wikimedia.org/r/403698 (owner: 10Faidon Liambotis) [14:21:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] base: wrap lines in check_puppetrun to < 110 [puppet] - 10https://gerrit.wikimedia.org/r/403698 (owner: 10Faidon Liambotis) [14:21:12] jynus: going to deploy mediawiki-config [14:21:17] <_joe_> godog: uh, which ones? [14:21:31] <_joe_> godog: I'd take a look at the output of hhvm-dump-debug [14:21:50] sorry, wrong channel. [14:21:57] _joe_: 2 cricital and 4 warning in icinga atm [14:22:12] (03CR) 10Marostegui: [C: 032] db-codfw.php: Promote db2043 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404976 (owner: 10Marostegui) [14:22:13] Please visit https://wikitech.wikimedia.org/wiki/Deployments#Week_of_January_15th [14:22:14] (03PS2) 10Ema: cache_upload: upgrade cp3035 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404990 (https://phabricator.wikimedia.org/T180433) [14:22:18] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp3035 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404990 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [14:22:46] _joe_: ok so save hhvm-dump-debug and systemctl restart hhvm [14:23:10] <_joe_> godog: yes [14:23:41] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2043 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404976 (owner: 10Marostegui) [14:23:44] there is also restart-hhvm to properly depool no? [14:23:44] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron) [14:23:46] who can swat today? [14:23:51] !log cache_upload: upgrade cp3035 to varnish 5 [14:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:14] godog: let me know if you need help, I was about to check [14:24:31] PROBLEM - puppet last run on mw1342 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [14:24:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991 [14:24:54] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991 [14:25:06] elukey: yup will do [14:25:15] !log restart hhvm on mw1227 [14:25:18] strange, still not logged on the binlog, but the row is present [14:25:20] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Promote db2043 to s3 master after db2036 crash (duration: 01m 12s) [14:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:33] jynus: mediawiki-config deployed [14:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:51] <_joe_> godog: I'll take a look at the rest of the cluster [14:26:14] jynus: sync binlog and trx commit maybe was changed on the live config? [14:26:22] <_joe_> godog: uhm wait, this is pretty strange [14:26:34] As a left over from something and was not 1 and 1? [14:26:48] <_joe_> uhm actually no, they do need restarts I'd say [14:26:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991 (owner: 10Marostegui) [14:27:03] so I start replication, ok? [14:27:06] yep [14:27:08] (03CR) 10jenkins-bot: db-codfw.php: Promote db2043 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404976 (owner: 10Marostegui) [14:27:12] _joe_: so mw1226 alarmed but looks like it recovered by itself [14:27:19] let's see if it complains [14:27:20] <_joe_> godog: let me take a look [14:27:24] _joe_: though mw1227 didn't and I restarted hhvm there [14:27:39] <_joe_> godog: there is a very high load on the api cluster [14:27:50] RECOVERY - MariaDB Slave IO: s3 on db2043 is OK: OK slave_io_state Slave_IO_Running: Yes [14:27:55] it doesn't [14:28:14] this is some vodoo magic due to mic of row and statement? [14:28:19] *mix [14:28:22] SWAT: All is Right? [14:28:33] !log cache_upload: repool cp3035 (varnish 5) [14:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] jynus: but it should still appear on the binlog [14:28:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991 (owner: 10Marostegui) [14:28:59] maybe it is is some strange cache- I will search it [14:29:10] maybe something related to parallel replication && caches [14:29:28] jynus: or as I said, maybe sync_binlog/trx_commit were changed manually for some reason [14:29:34] don't know, it is weird [14:29:48] the missing insert appeared [14:30:05] ????? [14:30:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 - T174569 (duration: 01m 12s) [14:30:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404991 (owner: 10Marostegui) [14:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:35] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [14:30:38] REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2018-01-18T14:27:26.000930' [14:30:40] and later [14:30:51] (03PS1) 10Marostegui: db2036: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/404992 [14:30:56] <_joe_> HPHP::jit::enterTCImpl seems to block quite a few threads [14:30:58] #180118 12:56:15 server id 171966669 [14:30:58] Hello SWAT? [14:31:11] jynus: :| [14:31:11] INSERT /* Wikimedia\Rdbms\DatabaseMysqlBase::upsert */ INTO `module_deps` (md_module,md_skin,md_deps) VALUES ( [14:31:16] <_joe_> !log restarting hhvm on a few API appservers [14:31:18] cache then? [14:31:21] so parallel weirdness [14:31:24] <_joe_> godog: there are a few more to restart [14:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:34] making binlog inconsistent with state [14:31:35] (03CR) 10Marostegui: [C: 032] db2036: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/404992 (owner: 10Marostegui) [14:31:41] even after reboot! [14:31:45] Zppix: ? [14:31:50] because this is on the new binlog after reboot [14:32:15] 10Puppet, 10Analytics-Kanban, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909685 (10elukey) Fixed all except j1.analytics.eqiad.wmflabs - @Ottomata do we still need this? It seems running superset, and puppet is broken in there.. [14:32:19] This is so weird... [14:32:22] And scary [14:32:24] I do not like this- I would be ok with the cache [14:32:25] Jayprakash12345: yes? [14:32:37] but not in cold [14:32:40] Zppix: https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_January_17 [14:33:27] Jayprakash12345: what about it? [14:33:31] on the other side, they have different domain ids [14:33:44] so we do not guarantee consisentcy [14:34:00] because different serverid == different domain [14:34:23] _joe_: ack, all of those in warning in icinga? [14:34:26] Zppix:Are you SWAT member??? [14:34:29] Jayprakash12345: you realize i have no prod access right... and today is the 18th [14:34:30] <_joe_> godog: at least [14:34:46] <_joe_> godog: I'm doing all servers up to mw1230 btw [14:35:02] Jayprakash12345: aka im not a deployer [14:35:12] Zppix: Sorry [14:35:23] Jayprakash12345: no worries :P [14:35:25] <_joe_> godog: use "restart-hhvm" btw [14:35:38] Jayprakash12345: regardless the next swat window is not for a while [14:35:40] <_joe_> it depools the host, restarts hhvm, then it pools it back [14:36:13] zeljkof: Can you merge https://gerrit.wikimedia.org/r/#/c/404067/ [14:36:42] (03CR) 10Steinsplitter: [C: 031] "looks ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404148 (https://phabricator.wikimedia.org/T184865) (owner: 10Biplab Anand) [14:36:59] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 9.15, 15.20, 23.20 [14:37:00] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 8.95, 13.50, 23.78 [14:37:19] Jayprakash12345: Is there a reason why you specifically ask zeljkof ? [14:37:43] (03PS1) 10Ema: cache_upload: upgrade cp3038 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404993 (https://phabricator.wikimedia.org/T180433) [14:37:46] Because He is SWAT member [14:38:02] _joe_: sounds good, I'll do on mw1233 [14:39:12] !log restart hhvm on mw1233 [14:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:29] RECOVERY - puppet last run on mw1342 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:39:39] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp3038 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/404993 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [14:39:54] !log cache_upload: upgrade cp3038 to varnish 5 [14:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:16] <_joe_> the cpu utilization curve of hhvm after a restart will never ever cease to amaze me. for the first couple minutes, it's worse than php5 [14:40:44] _joe_: JIT? [14:40:58] andre__: it's SWAT time :) [14:41:01] <_joe_> jynus: that and re-loading of all files into the repo [14:41:29] <_joe_> they need to validate the files when starting, but yeah it's mostly jit and caching [14:42:05] maybe we could "pre-compile" some or most of the code by executing reads on repool? [14:42:30] <_joe_> we did do something like that [14:42:33] (03PS4) 10Alexandros Kosiaris: Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) [14:42:35] (03PS6) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) [14:42:37] (03PS3) 10Alexandros Kosiaris: Simplify profile::grafana::production [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150) [14:42:40] (03PS3) 10Alexandros Kosiaris: grafana: Allow to modify the config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150) [14:42:41] (03PS3) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) [14:42:43] <_joe_> but the effects weren't good enough to justify investing in it [14:42:44] (03PS7) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [14:43:21] (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler02/9782/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:43:39] (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler02/9782/" [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:43:46] (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler02/9782/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:43:56] (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler02/9782/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:44:18] !log disabling puppet agents during deploy of 404587, 404689 [14:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:10] !log cache_upload: repool cp3038 (varnish 5) [14:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:45] (03CR) 10Herron: [C: 032] puppetmaster::ssl: fix crl file suffix [puppet] - 10https://gerrit.wikimedia.org/r/404587 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron) [14:45:50] (03PS2) 10Herron: puppetmaster::ssl: fix crl file suffix [puppet] - 10https://gerrit.wikimedia.org/r/404587 (https://phabricator.wikimedia.org/T184444) [14:46:26] (03PS3) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) [14:48:42] (03CR) 10Herron: [C: 032] add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron) [14:49:06] (03PS4) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) [14:51:07] (03CR) 10Rush: [C: 031] "we all talked it over, seems like a reasonable idea" [puppet] - 10https://gerrit.wikimedia.org/r/403621 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:53:41] (03PS1) 10Marostegui: db-eqiad.php Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404995 (https://phabricator.wikimedia.org/T162807) [14:54:19] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:54:42] ^ that’s me [14:56:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404995 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [14:58:37] !log reprepro includedeb jessie-wikimedia python-requests-mock_1.3.0-3_all.deb [14:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:51] !log installing bind security updates (we only use the client-side tools) [14:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:06] (03Merged) 10jenkins-bot: db-eqiad.php Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404995 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [14:59:16] (03CR) 10jenkins-bot: db-eqiad.php Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404995 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [14:59:49] (03CR) 10Volans: [C: 032] PuppetDB backend: add support for API v4 [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) (owner: 10Volans) [15:00:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 - T162807 (duration: 01m 11s) [15:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:10] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [15:01:43] !log Stop replication in sync db1089 and db1066 - T162807 [15:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:20] (03Merged) 10jenkins-bot: PuppetDB backend: add support for API v4 [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) (owner: 10Volans) [15:05:35] (03CR) 10jenkins-bot: PuppetDB backend: add support for API v4 [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) (owner: 10Volans) [15:07:00] PROBLEM - DPKG on restbase2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:08:09] RECOVERY - Keyholder SSH agent on labpuppetmaster1002 is OK: OK: Keyholder is armed with all configured keys. [15:10:09] RECOVERY - DPKG on restbase2005 is OK: All packages OK [15:10:10] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:40] PROBLEM - puppet last run on mw1343 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [15:11:12] !log mforns@tin Started deploy [analytics/refinery@78f98d9]: deploying refinery to add ISO codes to pageviews by country [15:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:00] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 54.93 seconds [15:12:10] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 31.23 seconds [15:12:10] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 29.62 seconds [15:12:10] RECOVERY - MariaDB Slave Lag: s3 on db2018 is OK: OK slave_sql_lag Replication lag: 29.63 seconds [15:12:40] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:12:53] I will reimage db2036 with db2018 data [15:14:28] (03CR) 10Andrew Bogott: [C: 031] "I think that if something on a WMCS VM is logging to production logstash then that's already wrong. So this seems fine with me unless you" [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [15:14:58] (03PS1) 10Marostegui: Revert "db-eqiad.php Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404997 [15:15:03] (03PS1) 10Volans: Copyright notice: add 2018 [software/cumin] - 10https://gerrit.wikimedia.org/r/404998 [15:15:25] !log mforns@tin Finished deploy [analytics/refinery@78f98d9]: deploying refinery to add ISO codes to pageviews by country (duration: 04m 12s) [15:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:22] jynus: ok! [15:16:40] PROBLEM - etc request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 58251 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:17:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404997 (owner: 10Marostegui) [15:17:12] akosiaris: FYI typo, missing 'd' in etcd :-P ^^^ [15:18:03] (03CR) 10Volans: [C: 032] Copyright notice: add 2018 [software/cumin] - 10https://gerrit.wikimedia.org/r/404998 (owner: 10Volans) [15:18:40] RECOVERY - etc request latencies on neon is OK: OK - etcd_request_latencies is 3791 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:18:42] (03PS1) 10Alexandros Kosiaris: grafana: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/405000 [15:18:46] volans: lol [15:18:49] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3909792 (10fgiunchedi) I tried manually setting `rootdelay` from within grub on the first boot of restbase2007 and 2008 with 5 and 3 respectively, both worked as in the raid arrays were assembled [15:19:14] (03Merged) 10jenkins-bot: Revert "db-eqiad.php Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404997 (owner: 10Marostegui) [15:19:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404997 (owner: 10Marostegui) [15:20:26] (03Merged) 10jenkins-bot: Copyright notice: add 2018 [software/cumin] - 10https://gerrit.wikimedia.org/r/404998 (owner: 10Volans) [15:20:40] (03PS1) 10Alexandros Kosiaris: Fix etcd request latencies alert typo [puppet] - 10https://gerrit.wikimedia.org/r/405001 [15:20:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 - T162807 (duration: 01m 12s) [15:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:53] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [15:21:10] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:26] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/405000 (owner: 10Alexandros Kosiaris) [15:21:30] (03CR) 10Alexandros Kosiaris: [C: 032] Fix etcd request latencies alert typo [puppet] - 10https://gerrit.wikimedia.org/r/405001 (owner: 10Alexandros Kosiaris) [15:21:38] (03CR) 10jenkins-bot: Copyright notice: add 2018 [software/cumin] - 10https://gerrit.wikimedia.org/r/404998 (owner: 10Volans) [15:22:25] 10Operations, 10ops-eqiad: check americium eth1 cabling and link - https://phabricator.wikimedia.org/T185219#3909801 (10Jgreen) [15:22:40] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:23:10] PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[dnsutils] [15:30:43] (03PS1) 10Ema: cache_upload: upgrade esams to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/405007 (https://phabricator.wikimedia.org/T180433) [15:33:29] (03CR) 10Ema: [C: 032] cache_upload: upgrade esams to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/405007 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [15:33:51] !log reboot labsdb1006 (OSM slave) for kernel security update [15:34:02] !log cache_upload: upgrade cp3047 to varnish 5 [15:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:17] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:35:18] RECOVERY - puppet last run on wtp1028 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:40:07] !log cache_upload: repool cp3047 (varnish 5) [15:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:58] !log rebooting labsdb1004 for kernel security update [15:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:16] 10Puppet, 10Analytics-Kanban, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909876 (10Ottomata) j1 deleted! [15:42:38] 10Puppet, 10Analytics-Kanban, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3909884 (10elukey) 05Open>03Resolved Closing the task since puppet should be ok now, please re-open otherwise! [15:43:03] moritzm: did you stop mariadb in advance? [15:43:50] jynus: that's postgres, I've synched the reboot with Aaron who's keeping an eye on wikilabels [15:44:00] no, that is postgres and mariadb [15:44:50] can you take care and/or fix replication inconsistencies when it boots? [15:45:15] *take care of checking [15:45:28] jynus: no, I wasn't aware of that, sorry. I've always just been following https://wikitech.wikimedia.org/wiki/Service_restarts#labsdb1004_(wikilabels) for this for all recent reboots [15:45:52] that is the postgres section [15:46:04] which obviously doesn't mention mariadb [15:46:15] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:46:51] please take care of that or ping cloud so they can do it [15:47:42] jynus: sorry, I was unaware of the mariadb part, will amend the docs. will ping the WMCS channel [15:47:58] the docs are ok [15:48:07] they tell you how to shutdown postgres [15:48:16] what they don't tell is what servers are where [15:48:23] RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:48:24] *services are on which servers [15:48:24] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:13] (03CR) 10Ppchelko: "Why bother people with SWAT? It's just a cleanup, it can wait until we have something substantial" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko) [15:50:16] yes, as I have mentioned I was unaware of mariadb being on that host, I'll add a reminder do the labsdb1004 docs so that this doesn't happen again [15:50:43] RECOVERY - puppet last run on mw1343 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:51:20] (03PS1) 10Filippo Giunchedi: install_server: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405008 (https://phabricator.wikimedia.org/T184100) [15:52:43] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:53:23] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:53:48] (03PS2) 10Andrew Bogott: puppet.conf: replace configtimeout [puppet] - 10https://gerrit.wikimedia.org/r/404480 (https://phabricator.wikimedia.org/T182585) (owner: 10Herron) [15:55:19] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3909954 (10Jgreen) [15:55:29] (03CR) 10Andrew Bogott: [C: 032] puppet.conf: replace configtimeout [puppet] - 10https://gerrit.wikimedia.org/r/404480 (https://phabricator.wikimedia.org/T182585) (owner: 10Herron) [16:01:40] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3909975 (10Anomie) >>! In T133410#3907781, @Iniquity wrote: > @Isarra we are waiting for T180817 this task. Regarding that, see T180817#... [16:02:10] 10Operations, 10Puppet: Update puppetmaster1001 puppet certificate - https://phabricator.wikimedia.org/T180167#3909979 (10herron) 05Open>03Resolved a:03herron done! [16:02:14] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3909982 (10herron) [16:04:02] (03CR) 10ArielGlenn: "This function is only called if the config file doesn't have entries in the [database] section for 'user' and 'password'. It's specificall" [dumps] - 10https://gerrit.wikimedia.org/r/404986 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight) [16:04:20] (03CR) 10ArielGlenn: [C: 031] "LGTM could merge whenever you want." [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight) [16:09:00] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3910002 (10Ottomata) [16:09:02] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3909999 (10Ottomata) 05Open>03Resolved deployment-kafka03 has been deleted. [16:13:00] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 4 minutes ago with 5 failures. Failed resources (up to 3 shown): Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0],Exec[absent_ensure_members],Exec[ops_ensure_members] [16:14:43] (03CR) 10Awight: "Thanks for noticing! I must have changed the config file at the same time as the code..." [dumps] - 10https://gerrit.wikimedia.org/r/404986 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight) [16:14:48] (03Abandoned) 10Awight: Use a global mysql user if configured [dumps] - 10https://gerrit.wikimedia.org/r/404986 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight) [16:15:38] (03PS3) 10Awight: Tolerate empty tables [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116) [16:18:42] (03PS2) 10Filippo Giunchedi: install_server: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405008 (https://phabricator.wikimedia.org/T184100) [16:20:13] (03CR) 10Filippo Giunchedi: [C: 032] install_server: reprovision restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/405008 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [16:21:39] (03CR) 10BryanDavis: [C: 031] "I agree with Andrew. If this is being picked up in a Cloud VPS host then the data is routing to the wrong place already and should be fixe" [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [16:28:19] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3910062 (10jcrespo) [16:29:58] (03PS1) 10Cmjohnson: Removing mgmt dns entries for decom host erbium [dns] - 10https://gerrit.wikimedia.org/r/405012 (https://phabricator.wikimedia.org/T185226) [16:33:11] PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:33:11] PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:33:21] PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:33:50] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:33:51] PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:34:00] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:34:01] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:35:50] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [16:35:51] RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient [16:36:00] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [16:36:01] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [16:36:11] RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full [16:36:20] RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational [16:36:21] RECOVERY - Disk space on pybal-test2001 is OK: DISK OK [16:41:37] (03PS1) 10Ottomata: Use jumbo Kafka for EventStreams in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/405014 (https://phabricator.wikimedia.org/T185225) [16:43:01] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:53:17] (03PS2) 10Ottomata: Use jumbo Kafka for EventStreams in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/405014 (https://phabricator.wikimedia.org/T185225) [16:54:18] (03CR) 10Ottomata: [C: 032] Use jumbo Kafka for EventStreams in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/405014 (https://phabricator.wikimedia.org/T185225) (owner: 10Ottomata) [16:54:21] (03CR) 10Ottomata: [C: 032] "No op https://puppet-compiler.wmflabs.org/compiler02/9785/" [puppet] - 10https://gerrit.wikimedia.org/r/405014 (https://phabricator.wikimedia.org/T185225) (owner: 10Ottomata) [16:55:53] !log cache_upload: upgrade cp3036 to varnish 5 [16:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:39] hey, anyone here messing with deploymenet-puppetmaster02? there are local changes to puppet repo there [16:59:41] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:00:04] godog, moritzm, and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:01:02] i'm stashing them [17:02:12] (03CR) 10Andrew Bogott: [C: 031] hieradata: extend SMART eqiad deployment [puppet] - 10https://gerrit.wikimedia.org/r/403621 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [17:02:19] !log cache_upload: repool cp3036 (varnish 5) [17:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:22] (03PS1) 10Dduvall: Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016 [17:04:01] (03PS2) 10Dduvall: Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016 [17:05:33] If you see mgmt icing errors for the next 20 mins or so that is me...all servers will be in c8 [17:08:10] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [17:11:07] (03PS3) 10Dduvall: Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016 [17:11:40] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910184 (10RobH) p:05Triage>03High [17:12:04] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910199 (10RobH) Since this involves #traffic as well as #netops, this plan should get @bblack's review/approval. [17:12:32] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3910200 (10ayounsi) [17:14:13] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10ayounsi) Port mapping is correct. Note that the "fe-" ports will be renamed "ge-", but their physical location is unchanged. [17:14:17] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910209 (10RobH) [17:18:44] (03PS1) 10Filippo Giunchedi: install_server: swap restbase2009 disk layout [puppet] - 10https://gerrit.wikimedia.org/r/405019 (https://phabricator.wikimedia.org/T184100) [17:20:01] (03PS2) 10Filippo Giunchedi: install_server: swap restbase2009 disk layout [puppet] - 10https://gerrit.wikimedia.org/r/405019 (https://phabricator.wikimedia.org/T184100) [17:20:58] (03CR) 10Filippo Giunchedi: [C: 032] install_server: swap restbase2009 disk layout [puppet] - 10https://gerrit.wikimedia.org/r/405019 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [17:21:00] 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10Volans) I've run clean + deactivate for cp4018 as part of cleanup of stale puppet certs. [17:26:06] !log cache_upload: upgrade cp3039 to varnish 5 [17:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:21] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3910263 (10ayounsi) [17:28:10] (03CR) 10Mobrovac: [C: 04-1] cassandra: create parent data directories with exec (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [17:29:36] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3910280 (10fgiunchedi) 05Open>03Resolved Thanks @Papaul ! Disk rebuilding [17:29:53] (03PS2) 10Gehel: mediawiki: remove unused logging configuration of mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) [17:29:55] (03PS1) 10Ottomata: Add druid defaults for easier setup in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/405021 [17:31:39] !log cache_upload: repool cp3039 (varnish 5) [17:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:09] (03CR) 10Gehel: [C: 032] mediawiki: remove unused logging configuration of mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/404980 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [17:33:45] !log starting compare.py on s3 codfw (it triggered db2036 crash before) [17:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:41] (03CR) 10Ottomata: [C: 032] Add druid defaults for easier setup in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/405021 (owner: 10Ottomata) [17:35:46] (03PS2) 10Ottomata: Add druid defaults for easier setup in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/405021 [17:36:50] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9787/" [puppet] - 10https://gerrit.wikimedia.org/r/405021 (owner: 10Ottomata) [17:39:40] !log rebooting sodium (and temporarily disable icinga-wm due to some expected spam due to clients failing to run apt-get update) [17:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:49] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Cleanup multiple definitions of logstash endpoint in puppet / hiera - https://phabricator.wikimedia.org/T182304#3910343 (10Gehel) Most references to logstash host are now consolidated in a single variable. There are... [17:44:47] sodium back up and ircecho re-started [17:46:05] 10Operations, 10Puppet, 10Patch-For-Review: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#3910365 (10herron) CRL is now being checked by the puppet master apache frontends. There were a few issues encountered after deployment # puppetmaster1001 cert... [17:47:03] moritzm: I guess we'll still get some recovery spam in ~30 min, right? [17:48:19] volans: I expected a load of puppet failures, but I'm not seeing any in "Current Network Status" in icinga, seems to be handled more gracefully than anticipated [17:48:33] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3910381 (10RStallman-legalteam) Sorry I missed this ping through phabricator. Christian Covington's NDA is signed and on file and I'm updating the NDA spreadshe... [17:49:16] good :) [17:49:52] (03PS1) 10Chad: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025 [17:51:31] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3910404 (10Imarlier) We should look at the varnish logs to see if we can find other pages that have a similar behavior. Is this part of T164248 [17:54:59] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3910411 (10RobH) [17:55:27] !log cache_upload: upgrade cp3044 to varnish 5 [17:55:29] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 346 MB (3% inode=61%) [17:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:02] (03PS2) 10RobH: adding shell user cy534 [puppet] - 10https://gerrit.wikimedia.org/r/403088 (https://phabricator.wikimedia.org/T184473) [17:58:05] (03CR) 10RobH: [C: 032] adding shell user cy534 [puppet] - 10https://gerrit.wikimedia.org/r/403088 (https://phabricator.wikimedia.org/T184473) (owner: 10RobH) [17:58:25] 10Operations, 10MediaWiki-JobQueue, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3910419 (10Imarlier) [17:59:04] 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-JobQueue, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3904733 (10Imarlier) `monospaced text` [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:17] !log cache_upload: repool cp3044 (varnish 5) [18:00:26] (03PS2) 10RobH: adds cy534 to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/403090 (https://phabricator.wikimedia.org/T184473) [18:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:36] ORES isn’t scheduled to cause surprise downtime today. [18:01:06] (03PS1) 10Muehlenhoff: Ensure all packages are updated when d-i installs security updates [puppet] - 10https://gerrit.wikimedia.org/r/405026 [18:02:17] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910431 (10Reedy) [18:02:42] (03Draft2) 10Jayprakash12345: Add "Portal" namespace on it.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405024 (https://phabricator.wikimedia.org/T185232) [18:04:25] anyone doing services deploy? can I join in for a scap deploy of /srv/3d2png? [18:06:25] (03PS2) 10Chad: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025 [18:07:17] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910454 (10MoritzMuehlenhoff) Ops and Releng are using pwstore, which is just using a simple git repository underneath for storage. The repo for Ops is stored on a private host and IIRC R... [18:11:57] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910431 (10demon) We did use a private repo in Phabricator. It's worked great. [18:12:38] alright - would I be disrupting anyone/-thing if I were to scap deploy /srv/3d2png now (or any other reason not to do so - first time updating service so I don't know the process for these things all too well :)) [18:13:09] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910492 (10Reedy) Something like that sounds fine, yeah. Be nice to do something that isn't wheel re-inventing obviously, so if we can follow what releng do... That works for me :) [18:14:25] (03PS3) 10Chad: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025 [18:17:03] greg-g ^ - would now be a good time to update 3d2png service? (or is there some process I should go through) [18:17:32] matthiasmullie: yup! [18:17:59] matthiasmullie: just !log what you're doing :) (scap does it for you if you give it a message) [18:18:42] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910505 (10MoritzMuehlenhoff) Some general docs (targeted at ops pwstore, but pretty similar) are at https://office.wikimedia.org/wiki/Pwstore [18:19:02] alright thanks [18:22:02] (03PS4) 10RobH: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025 (owner: 10Chad) [18:22:10] (03PS1) 10Sbisson: Hide Flow beta feature everywhere but testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405034 (https://phabricator.wikimedia.org/T184670) [18:22:25] !log mlitn@tin Started deploy [3d2png/deploy@74b1ed7]: Updating 3d2png repo [18:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:15] !log mlitn@tin Finished deploy [3d2png/deploy@74b1ed7]: Updating 3d2png repo (duration: 00m 50s) [18:23:22] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910543 (10demon) Our process in Releng is identical, save the "where to clone from" bit. [18:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:46] (03CR) 10RobH: [C: 032] Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/405025 (owner: 10Chad) [18:25:04] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910431 (10greg) Our repo (https://phabricator.wikimedia.org/source/releng-secrets/repository/master/ ) haha you can't see it! ;) but yeah, it's straight-forward to make private repos in Ph... [18:28:53] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910552 (10Reedy) Cool. I'll check in with Brian/John that they're ok with doing it like this before we go ahead and actually create anything [18:31:32] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/demon/.pylintrc] [18:33:20] 10Operations, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910563 (10fdans) [18:33:22] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 43.51, 36.76, 32.84 [18:33:28] greg-g, I'm doing an emergency deploy for https://phabricator.wikimedia.org/T184670 (SD opt-in/opt-out breakage). We may also have to run a maintenance script later. CC stephanebisson [18:33:54] !log cache_upload: upgrade cp3045 to varnish 5 [18:34:01] greg-g, first deployed change just disables the Beta Feature on non-talk. Existing users will keep their current discussion page, but they won't be able to turn on or off. [18:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:27] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910565 (10Reedy) p:05Triage>03Normal [18:35:00] matt_flaschen: yuck, godspeed. [18:35:04] cc thcipriani ^ [18:35:35] ack [18:37:36] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#3910569 (10fdans) [18:39:12] !log cache_upload: repool cp3045 (varnish 5) [18:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:32] !log bsitzmann@tin Started deploy [mobileapps/deploy@669fb5b]: Update mobileapps to 2690899 (T184328 T184557 T177007 T184669 T177430 T185050) [18:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:48] T184328: Add revision + tid to references output - https://phabricator.wikimedia.org/T184328 [18:40:48] T177007: Should we flatten spans in summary output - https://phabricator.wikimedia.org/T177007 [18:40:49] T177430: Develop a Media JSON API - https://phabricator.wikimedia.org/T177430 [18:40:49] T184669: Fetching incorrect description data from the Feed API - https://phabricator.wikimedia.org/T184669 [18:40:49] T185050: Run comparison of html extracts again - https://phabricator.wikimedia.org/T185050 [18:41:09] 10Operations, 10Analytics-Kanban, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492#3910612 (10fdans) [18:41:11] 10Operations, 10Puppet: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239#3910613 (10herron) p:05Triage>03Normal [18:42:02] (03CR) 10Mattflaschen: [C: 032] Hide Flow beta feature everywhere but testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405034 (https://phabricator.wikimedia.org/T184670) (owner: 10Sbisson) [18:44:11] (03Merged) 10jenkins-bot: Hide Flow beta feature everywhere but testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405034 (https://phabricator.wikimedia.org/T184670) (owner: 10Sbisson) [18:44:44] 10Operations, 10Puppet: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239#3910632 (10herron) [18:45:29] 10Operations, 10procurement: Give access to S4 (procurement tasks) to Deb Tankersley - https://phabricator.wikimedia.org/T185240#3910634 (10Gehel) [18:47:07] (03CR) 10jenkins-bot: Hide Flow beta feature everywhere but testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405034 (https://phabricator.wikimedia.org/T184670) (owner: 10Sbisson) [18:47:35] !log bsitzmann@tin Finished deploy [mobileapps/deploy@669fb5b]: Update mobileapps to 2690899 (T184328 T184557 T177007 T184669 T177430 T185050) (duration: 07m 03s) [18:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:52] T184328: Add revision + tid to references output - https://phabricator.wikimedia.org/T184328 [18:47:52] T177007: Should we flatten spans in summary output - https://phabricator.wikimedia.org/T177007 [18:47:52] T177430: Develop a Media JSON API - https://phabricator.wikimedia.org/T177430 [18:47:53] T184669: Fetching incorrect description data from the Feed API - https://phabricator.wikimedia.org/T184669 [18:47:53] T185050: Run comparison of html extracts again - https://phabricator.wikimedia.org/T185050 [18:49:11] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910664 (10Bawolff) Do we actually have shared secrets? Im fine with the private repo approach (the repo is secret and then pgp on top of that, right?) [18:51:49] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910667 (10Reedy) >>! In T185236#3910664, @Bawolff wrote: > Do we actually have shared secrets? Not really at the moment, but I'd imagine there's stuff other teams have that we should prob... [18:52:47] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3910670 (10elukey) Had an interesting chat with @ayounsi and for the moment it seems that the only format expected in the ne... [18:54:44] !log cache_upload: upgrade cp3046 to varnish 5 [18:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:32] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:58:29] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings.php: T184670: Hide Flow beta feature everywhere but testwiki (duration: 01m 10s) [18:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:44] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200) [18:58:44] T184670: [wmf.16-regression] Fatal exception of type "Flow\Exception\InvalidDataException" for opting out from "Structured Discussions on user talk" - https://phabricator.wikimedia.org/T184670 [18:58:52] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200) [18:58:52] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200) [18:59:03] 10Operations, 10Patch-For-Review: Decommission host erbium - https://phabricator.wikimedia.org/T185226#3910697 (10Aklapper) @Framawiki: The one you can already find in the patch. [18:59:13] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200) [18:59:13] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200) [18:59:23] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200) [19:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T1900). [19:00:04] subbu, Trey314159, and edsanders: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:11] !log cache_upload: repool cp3046 (varnish 5) [19:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:13] joal: problems with aqs? [19:02:15] stephanebisson, done and confirmed. It should be fine, but in the unlikely event it makes things worse, morning SWAT is now. Ping me on Hangouts if anything goes wrong. I will try to run the dry-run before our meeting. [19:02:30] ema: Thanks for pinging, I wouldn't have noticed :( No real problem, will silent the alarm [19:02:47] alright! :) [19:04:25] I can SWAT [19:04:57] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236#3910718 (10greg) >>! In T185236#3910664, @Bawolff wrote: > Im fine with the private repo approach (the repo is secret and then pgp on top of that, right?) Right. [19:05:12] subbu|lunch: Trey314159 edsanders ping for SWAT if you all are around. [19:05:26] thcipriani: ack [19:05:52] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [19:05:52] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [19:05:53] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [19:05:58] ema: Done --^ [19:06:04] ema: Thanks again ! [19:06:13] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [19:06:13] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [19:06:23] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [19:06:48] ack [19:07:36] 10Operations, 10Puppet, 10Traffic: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239#3910723 (10ema) [19:08:00] !log arlolra@tin Started deploy [parsoid/deploy@fcc2b63]: Updating Parsoid to af06386 [19:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:18] (03CR) 10RobH: [C: 032] adds cy534 to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/403090 (https://phabricator.wikimedia.org/T184473) (owner: 10RobH) [19:08:23] (03PS3) 10RobH: adds cy534 to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/403090 (https://phabricator.wikimedia.org/T184473) [19:08:56] Hmm, cobalt isn't responding on :22 to me :\ [19:10:32] PROBLEM - Disk space on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:10:33] PROBLEM - swift-container-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:10:33] PROBLEM - swift-account-reaper on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:10:42] PROBLEM - DPKG on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:10:50] Nvm, user error [19:10:53] PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:10:54] PROBLEM - swift-container-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:02] PROBLEM - swift-account-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:02] (03PS2) 10Thcipriani: Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry) [19:11:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry) [19:11:13] PROBLEM - configured eth on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:22] PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:23] PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:32] PROBLEM - swift-object-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:42] PROBLEM - swift-object-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:12:12] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:12:12] PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:12:22] PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:13] PROBLEM - very high load average likely xfs on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:22] PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:23] PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:25] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3910743 (10RobH) a:05cy534>03None [19:13:50] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3884118 (10RobH) All access patches merged. It can take up to 30 minutes for affected hosts to receive the update. If you have any questions or issues, please feel free to reopen... [19:13:52] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:57] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3910748 (10RobH) 05Open>03Resolved a:03RobH [19:14:13] PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:15:03] PROBLEM - swift-account-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:15:03] PROBLEM - swift-account-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:15:13] PROBLEM - configured eth on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:15:22] RECOVERY - swift-object-updater on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:15:22] RECOVERY - dhclient process on ms-be2023 is OK: PROCS OK: 0 processes with command name dhclient [19:15:22] RECOVERY - swift-container-updater on ms-be2023 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:15:23] RECOVERY - Disk space on ms-be2023 is OK: DISK OK [19:15:23] RECOVERY - swift-object-auditor on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:15:32] RECOVERY - swift-account-reaper on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:15:32] RECOVERY - swift-container-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:15:32] RECOVERY - DPKG on ms-be2023 is OK: All packages OK [19:15:32] RECOVERY - swift-object-server on ms-be2023 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:15:52] RECOVERY - swift-container-server on ms-be2023 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:15:53] RECOVERY - swift-account-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:15:53] RECOVERY - swift-account-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:15:53] RECOVERY - swift-account-server on ms-be2023 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:15:55] (03Merged) 10jenkins-bot: Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry) [19:16:03] RECOVERY - very high load average likely xfs on ms-be2023 is OK: OK - load average: 37.94, 40.40, 36.91 [19:16:03] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2023 is OK: OK ferm input default policy is set [19:16:12] RECOVERY - configured eth on ms-be2023 is OK: OK - interfaces up [19:16:13] RECOVERY - Check size of conntrack table on ms-be2023 is OK: OK: nf_conntrack is 3 % full [19:16:13] RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:16:51] subbu: your change is on mwdebug1002, anything to check there? [19:17:00] not really. [19:17:01] (03CR) 10jenkins-bot: Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry) [19:17:19] you can proceed. i'll watch db stats in a few hours to see if the failures have gone down. [19:17:32] !log arlolra@tin Finished deploy [parsoid/deploy@fcc2b63]: Updating Parsoid to af06386 (duration: 09m 32s) [19:17:39] subbu: kk, doing. [19:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:06] 10Operations, 10procurement: Give access to S4 (procurement tasks) to Deb Tankersley - https://phabricator.wikimedia.org/T185240#3910766 (10RobH) 05Open>03Resolved For those following along, this is specifically for acl*procurement-review which is a group that I maintain. I've reviewed and added @debt as... [19:18:22] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:18:42] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures [19:19:42] (03PS9) 10Thcipriani: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [19:20:06] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404233|Update linter stats for commonswiki less frequently]] T184280 (duration: 01m 13s) [19:20:13] subbu: ^ live [19:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:17] T184280: Linter multiple database issues - https://phabricator.wikimedia.org/T184280 [19:20:33] ty [19:20:42] !log cache_upload: upgrade cp3049 to varnish 5 [19:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [19:24:15] !log Updated Parsoid to af06386 (T45094) [19:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:28] T45094: Parsoid: References should be wrapped in a , not a - https://phabricator.wikimedia.org/T45094 [19:24:45] (03CR) 10Dzahn: "thanks for merging, and that ticket. glad it wasn't just me, didn't get why :)" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [19:25:44] 10Operations, 10procurement: Give access to S4 (procurement tasks) to Deb Tankersley - https://phabricator.wikimedia.org/T185240#3910846 (10debt) Thanks for the add, @RobH, and I understand. :) [19:25:57] (03Merged) 10jenkins-bot: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [19:26:32] Trey314159: ^ is live on mwdebug1002, check please [19:26:38] will do [19:27:07] (03CR) 10jenkins-bot: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [19:28:38] thcipriani: the transliteration menu showed up, but clicking it is giving errors. Test failed. :( [19:28:59] * subbu realizes what the 314159 in Trey314159 is about after seeing it so many times and not paying attention :) [19:29:10] (03PS5) 10Eevans: cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) [19:29:31] Trey314159: ok, should I go ahead and revert or is there any other testing you need to do? [19:30:03] thcipriani: can you give me a minute to look around and then revert? [19:30:37] yep, I'll make the revert but leave it up on mwdebug1002 for a few. [19:31:51] sorry, I'm here if needed [19:32:22] right on time, was just about to merge your patch :) [19:32:23] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:35:42] (03PS1) 10Thcipriani: Revert "Updates to enable transliteration for crhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405043 [19:37:45] (03CR) 10Eevans: cassandra: create parent data directories with exec (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [19:38:05] thcipriani: thanks. I'm not sure what happened, but I got some data for looking into it. [19:38:09] (03CR) 10Eevans: "Updated [PC output](http://puppet-compiler.wmflabs.org/9788/)" [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [19:38:18] Trey314159: sure thing, glad it was useful :) [19:38:29] (03CR) 10Thcipriani: [C: 032] Revert "Updates to enable transliteration for crhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405043 (owner: 10Thcipriani) [19:40:37] (03Merged) 10jenkins-bot: Revert "Updates to enable transliteration for crhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405043 (owner: 10Thcipriani) [19:40:47] (03CR) 10jenkins-bot: Revert "Updates to enable transliteration for crhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405043 (owner: 10Thcipriani) [19:41:01] (03PS3) 10Dzahn: releases: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad) [19:43:20] edsanders: your change is live on mwdebug1002, check please [19:43:52] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:43:53] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:44:02] PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:44:03] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:44:22] PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:44:52] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [19:44:53] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [19:45:02] RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient [19:45:03] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [19:45:09] thcipriani: how do I do that :) [19:45:22] RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full [19:45:39] (03PS4) 10Dzahn: releases: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad) [19:45:46] * thcipriani digs for the documentation [19:46:16] edsanders: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [19:46:37] (03PS1) 10Tjones: Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 [19:47:19] (03PS1) 10Kaldari: Removing unused citizendium from $wgRelatedSitesPrefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405049 (https://phabricator.wikimedia.org/T185246) [19:48:46] (03CR) 10Tjones: "The deployment of the transliteration in mediawiki-config did not work, so we should undo this config until we know what's going on." [puppet] - 10https://gerrit.wikimedia.org/r/405048 (owner: 10Tjones) [19:49:34] (03PS2) 10Kaldari: Removing unused citizendium from $wgRelatedSitesPrefixes... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405049 (https://phabricator.wikimedia.org/T185246) [19:50:05] thcipriani: looks good to me [19:50:27] okie doke, pushing out everywhere [19:51:20] (03CR) 10Dzahn: [C: 032] releases: Sync security patches for MW from deployment to nightlies server [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad) [19:53:23] !log thcipriani@tin Synchronized php-1.31.0-wmf.17/extensions/VisualEditor/modules/ve-mw/ui/pages/ve.ui.MWTemplatePlaceholderPage.js: SWAT: [[gerrit:405028|Update TitleInput getTitle to getMWTitle]] (duration: 01m 09s) [19:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:36] ^ edsanders live everywhere now, thanks! [19:53:56] thanks [19:54:19] !log arlolra@tin Started deploy [parsoid/deploy@8736b8c]: Updating Parsoid config [19:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:37] (03CR) 10Dzahn: "tin: created rsyncd config releases1001: created rsync cron to pull naos: nothing releases2001: nothing good!" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad) [19:56:21] !log arlolra@tin Finished deploy [parsoid/deploy@8736b8c]: Updating Parsoid config (duration: 02m 01s) [19:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:45] (03CR) 10Dzahn: "follow-up needed, coming up in a minute" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad) [19:59:15] !log arlolra@tin Started deploy [parsoid/deploy@8736b8c]: (no justification provided) [19:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:00] !log arlolra@tin Finished deploy [parsoid/deploy@8736b8c]: (no justification provided) (duration: 00m 44s) [20:00:04] thcipriani: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180118T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:12] !log arlolra@tin Started deploy [parsoid/deploy@8736b8c]: (no justification provided) [20:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:28] working on it, jouncebot. working on it... [20:01:21] !log arlolra@tin Finished deploy [parsoid/deploy@8736b8c]: (no justification provided) (duration: 01m 09s) [20:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:32] PROBLEM - SSH on ms-be2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:02:42] PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:02:42] PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:04:22] PROBLEM - very high load average likely xfs on ms-be2023 is CRITICAL: CRITICAL - load average: 180.95, 108.02, 64.29 [20:04:42] PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:05:58] (03PS1) 10Dzahn: releases: ensure /srv/patches directory exists [puppet] - 10https://gerrit.wikimedia.org/r/405052 [20:06:22] RECOVERY - SSH on ms-be2023 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [20:06:33] RECOVERY - dhclient process on ms-be2023 is OK: PROCS OK: 0 processes with command name dhclient [20:06:33] RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:07:00] (03CR) 10Dzahn: [C: 032] releases: ensure /srv/patches directory exists [puppet] - 10https://gerrit.wikimedia.org/r/405052 (owner: 10Dzahn) [20:07:11] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/405052/" [puppet] - 10https://gerrit.wikimedia.org/r/404892 (owner: 10Chad) [20:09:22] !log thcipriani@tin Synchronized php-1.31.0-wmf.17/extensions/Score/includes/Score.php: SWAT: [[gerrit:405029|Always pass FileBackend instance to `new FileRepo()`]] T185204 (duration: 01m 12s) [20:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:35] T185204: Score causes InvalidArgumentException from line 183 of /srv/mediawiki/php-1.31.0-wmf.17/includes/filebackend/FileBackendGroup.php: No backend defined with the name ``. - https://phabricator.wikimedia.org/T185204 [20:09:49] !log releases1001 - /srv/patches got created, initial manual rsync using /usr/local/sbin/sync-srv-patches created by rsync::quickdatacopy, mw patches exists on nightlies server now [20:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:23] RECOVERY - very high load average likely xfs on ms-be2023 is OK: OK - load average: 32.84, 79.74, 70.31 [20:17:24] Hi all! [20:18:03] If I want to tunnel in to prometheus1001 to use the prometheus web ui directly, which credentials should I use? [20:18:26] Or how should I request access? [20:20:16] paladox: Hmmm https://phabricator.wikimedia.org/P6616 [20:20:26] I wonder if my postdata is busted [20:20:31] requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://gerrit.wikimedia.org/r/a/projects/mediawiki%2Fcore/branches/test [20:20:44] that branch exists [20:20:46] so hmm [20:21:30] Duhhhhhhhhh [20:21:38] I'm a dumbass [20:21:38] no_justification i wonder are you trying to fetch the branch or create it? [20:21:46] Data should just be a dict. [20:21:50] Not fake JSON [20:21:51] oh [20:21:54] (03PS1) 10Thcipriani: All wikis to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405055 [20:21:57] * no_justification smacks self [20:22:39] (Note that /projects/ is being renamed to /repo/ :)) [20:22:41] "scappish" :) [20:23:43] (03CR) 10Andrew Bogott: rabbitmq: handling users and initial setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [20:23:52] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:23:55] thcipriani: Where I put scap-related things! [20:24:00] Scap-ish things! [20:24:01] :p [20:24:04] makes sense [20:24:09] :) [20:24:31] https://gerrit.wikimedia.org/r/405056 should fix things [20:24:35] (03CR) 10Thcipriani: [C: 032] All wikis to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405055 (owner: 10Thcipriani) [20:24:53] PROBLEM - Disk space on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:03] PROBLEM - swift-container-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:03] PROBLEM - swift-object-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:32] PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:32] PROBLEM - swift-container-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:42] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:42] PROBLEM - configured eth on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:52] PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:52] PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:25:52] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:26:12] (03Merged) 10jenkins-bot: All wikis to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405055 (owner: 10Thcipriani) [20:26:12] PROBLEM - swift-account-reaper on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:26:33] PROBLEM - swift-account-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:26:33] PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:27:11] (03CR) 10jenkins-bot: All wikis to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405055 (owner: 10Thcipriani) [20:27:53] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:27:53] PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:28:12] PROBLEM - swift-object-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:28:12] PROBLEM - DPKG on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:28:21] ms-be2023 seems unhappy [20:28:39] godog ^^. [20:28:52] PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:29:02] PROBLEM - Disk space on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:29:43] PROBLEM - swift-container-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:29:43] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:29:52] PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:29:52] PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:30:12] PROBLEM - swift-object-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:30:12] PROBLEM - DPKG on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:30:42] PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:30:54] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:31:11] it will recover soon [20:31:17] !log thcipriani@tin rebuilt and synchronized wikiversions files: All wikis to 1.31.0-wmf.17 [20:31:19] it's syncing [20:31:25] and really busy, but not dead [20:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:53] PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:32:32] RECOVERY - swift-container-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [20:32:32] RECOVERY - swift-account-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [20:32:32] RECOVERY - swift-container-server on ms-be2023 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [20:32:33] RECOVERY - Check size of conntrack table on ms-be2023 is OK: OK: nf_conntrack is 4 % full [20:32:33] RECOVERY - swift-account-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [20:32:33] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2023 is OK: OK ferm input default policy is set [20:32:33] RECOVERY - configured eth on ms-be2023 is OK: OK - interfaces up [20:32:43] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures [20:32:43] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:32:43] RECOVERY - dhclient process on ms-be2023 is OK: PROCS OK: 0 processes with command name dhclient [20:32:43] RECOVERY - swift-object-updater on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [20:32:43] RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:32:52] RECOVERY - Disk space on ms-be2023 is OK: DISK OK [20:32:53] RECOVERY - swift-container-updater on ms-be2023 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [20:33:02] RECOVERY - swift-account-reaper on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:33:02] RECOVERY - DPKG on ms-be2023 is OK: All packages OK [20:33:02] RECOVERY - swift-container-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:33:02] RECOVERY - swift-object-server on ms-be2023 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [20:33:02] RECOVERY - swift-object-auditor on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [20:33:22] (03PS1) 10Phantom42: mediawiki: Better error page layout on mobile devices [puppet] - 10https://gerrit.wikimedia.org/r/405058 (https://phabricator.wikimedia.org/T182247) [20:36:23] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:40:22] PROBLEM - carbon-cache@h service on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:41:22] RECOVERY - carbon-cache@h service on graphite1003 is OK: OK - carbon-cache@h is active [20:41:33] (03PS2) 10Awight: [DO NOT MERGE] Update ORES venv path to use versioned cache [puppet] - 10https://gerrit.wikimedia.org/r/392683 (https://phabricator.wikimedia.org/T181071) [20:43:49] (03CR) 10Gehel: "@Tjones: could you add the reason for the revert in the commit message? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (owner: 10Tjones) [20:53:00] 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission host erbium - https://phabricator.wikimedia.org/T185226#3911110 (10Peachey88) [20:53:05] 10Operations, 10Ops-Access-Requests: Requesting access to stat1004, stat1005, stat1006 for mneisler - https://phabricator.wikimedia.org/T184838#3911111 (10RobH) Please note we're still awaiting your managers sign off on this task. Once that is in, we should be able to process this. Thanks! [20:53:40] !log arlolra@tin Started deploy [parsoid/deploy@a95fede]: Update Parsoid config, again [20:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:19] !log arlolra@tin Finished deploy [parsoid/deploy@a95fede]: Update Parsoid config, again (duration: 09m 39s) [21:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:23] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:22:04] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for decom host erbium [dns] - 10https://gerrit.wikimedia.org/r/405012 (https://phabricator.wikimedia.org/T185226) (owner: 10Cmjohnson) [21:28:52] (03PS3) 10Chad: Gerrit 2.14.6 [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) [21:31:28] 10Operations: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3911209 (10Dzahn) [21:33:59] 10Operations: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3911213 (10Dzahn) [21:34:33] 10Operations: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3884434 (10Dzahn) a:05Dzahn>03None [21:34:47] 10Operations, 10hardware-requests: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3884434 (10Dzahn) [21:35:25] (03PS24) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [21:36:33] (03PS2) 10Tjones: Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (https://phabricator.wikimedia.org/T23582) [21:36:57] (03CR) 10jerkins-bot: [V: 04-1] Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [21:40:01] (03CR) 10Paladox: [C: 031] Gerrit 2.14.6 [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad) [21:44:52] (03PS3) 10Tjones: Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (https://phabricator.wikimedia.org/T23582) [21:47:37] (03CR) 10Rush: rabbitmq: handling users and initial setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [21:48:14] (03PS4) 10Tjones: Revert "Updates to enable short URLs for transliteration for crhwiki - beta" [puppet] - 10https://gerrit.wikimedia.org/r/405048 (https://phabricator.wikimedia.org/T23582) [21:50:26] (03CR) 10Andrew Bogott: [C: 031] "I'm convinced" [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [22:04:47] (03PS1) 10Andrew Bogott: role::labs::mediawiki_vagrant: Warn if not on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/405203 [22:13:54] (03PS1) 10Dduvall: Add service-checker image used to test service images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 [22:15:27] (03PS1) 10EBernhardson: Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250) [22:15:33] (03PS2) 10BryanDavis: role::labs::mediawiki_vagrant: Warn if not on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) (owner: 10Andrew Bogott) [22:16:04] Uh crap. I think I broke archiva. It's returning 502 bad gateway [22:16:12] (03CR) 10BryanDavis: [C: 031] "Added references to the bug report so we remember to clean this up when I finally get around to fixing the problem." [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) (owner: 10Andrew Bogott) [22:16:40] (03CR) 10Paladox: "I found the error see https://phabricator.wikimedia.org/T180377#3911321 maybe easier to fix it in lxc." [puppet] - 10https://gerrit.wikimedia.org/r/405203 (https://phabricator.wikimedia.org/T180377) (owner: 10Andrew Bogott) [22:16:43] (03CR) 10jerkins-bot: [V: 04-1] Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250) (owner: 10EBernhardson) [22:16:53] (03PS2) 10Dduvall: Add service-checker image used to test service images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) [22:17:06] no_justification works for me [22:17:11] Yeah nvm it's back [22:20:18] (03PS4) 10Dduvall: Include scaffold for service-checker helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/405016 [22:22:48] (03CR) 10Dduvall: Add service-checker image used to test service images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) (owner: 10Dduvall) [22:23:24] (03Draft1) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 [22:23:26] (03PS2) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 [22:23:43] (03PS2) 10EBernhardson: Switch wiktionary sister search on enwiki to title only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405206 (https://phabricator.wikimedia.org/T185250) [22:26:47] (03PS3) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 [22:34:06] (03PS4) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) [22:34:40] (03CR) 10Paladox: "This allows puppet to run locally for me on a stretch instance." [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [22:34:49] 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3911375 (10herron) This week @chasemp discovered an issue where `puppet apply` breaks on trusty hosts with `Error: Evaluation Error: Error while evaluating a Function Call, uninitialized constant RGen::ECore::ELong`.... [22:34:52] (03CR) 10BryanDavis: lxc: Fix support for stretch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [22:35:23] (03CR) 10BryanDavis: "> This allows puppet to run locally for me on a stretch instance." [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [22:36:50] !log added ruby-rgen-0.7.0-1 (backported package from jessie) to trusty-wikimedia apt repo (T182894) [22:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:04] T182894: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894 [22:37:25] (03PS5) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) [22:37:36] (03CR) 10Paladox: lxc: Fix support for stretch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [22:38:31] (03CR) 10Faidon Liambotis: lxc: Fix support for stretch (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [22:39:28] ah, there's a new PS [22:40:50] (03CR) 10Faidon Liambotis: [C: 04-1] lxc: Fix support for stretch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [22:50:23] 10Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-General-or-Unknown: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000#3911411 (10TTO) [22:52:42] (03PS6) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) [22:54:27] (03CR) 10Paladox: lxc: Fix support for stretch (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [22:56:09] (03CR) 10BryanDavis: lxc: Fix support for stretch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [23:01:17] (03PS7) 10Paladox: lxc: Fix support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/405208 (https://phabricator.wikimedia.org/T180377) [23:11:52] !log bootstrapping restbase1015-b -- T184100 [23:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:05] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [23:30:35] 10Operations: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3911551 (10Dzahn) [23:31:04] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3884422 (10Dzahn) [23:33:06] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3911556 (10Dzahn) [23:33:20] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3884422 (10Dzahn) a:05Dzahn>03None [23:38:31] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#3911588 (10Dzahn) [23:42:53] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 10.00, 18.13, 23.44