[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T0000). [00:02:27] dear jouncebot: yes, master [00:03:46] !log Preparing to deploy phabricator update. Tagged release/2016-06-08/1 [00:03:53] (03CR) 10Dzahn: [C: 032] "yep, confirmed, the old ones are gone" [dns] - 10https://gerrit.wikimedia.org/r/293446 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [00:05:38] wtf logmsgbot [00:06:53] !log meta T46791 [00:06:53] T46791: [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries - https://phabricator.wikimedia.org/T46791 [00:09:07] stashbot: tell wikibugs that logmsgbot is gone [00:11:44] !log restarted the log bot [00:11:59] what.. [00:12:53] oh, wrong bot , duh [00:14:24] taking phabricator offline for a moment [00:14:37] (scheduled, icinga already silenced) [00:15:19] !log restarted the log bot [00:15:30] gimme a break [00:16:43] joins #morebots-test :p [00:17:16] well that doesnt exist anymore like the docs say [00:19:37] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2366623 (10Papaul) [00:19:52] all copies of morebots stopped logging [00:20:00] not just the one here in production channel [00:20:13] i think toollabs issue then [00:20:36] more in -labs [00:20:57] 06Operations, 10ops-codfw: rack/setup/deploy mw22[1-5][0-9] switch configuration - https://phabricator.wikimedia.org/T136670#2366624 (10Papaul) rectification not B4 but A4 [00:26:57] 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2366638 (10Smalyshev) I checked the session size on my local vagrant install, it's 780 bytes, so not too big. Of course, productions sessions may be... [00:28:32] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [00:32:20] woah, phabricator is now sending html email? [00:33:16] legoktm yep [00:33:29] was changed to default [00:33:32] is it possible that this broke morebots [00:33:35] [15:44:06] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: enable AuthManager on group1 for reals T135504 (duration: 00m 25s) [00:33:36] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [00:34:07] html email? heh, glad i changed it all to browser notifications [00:34:09] mutante maybe but we can set it back to plain email for that bot. [00:34:49] legoktm download links too https://phabricator.wikimedia.org/diffusion/MW/ [00:35:32] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:33] plain email? [00:35:54] mutante yes you can change it in settings. [00:36:14] Maybe twentyafterfour would know [00:36:40] I don't think that broke the bots because the bots broke shortly before the update, not after [00:36:56] none of the bots AFAIK parse phab email [00:37:04] the log message abotu the AuthManager being enabled [00:37:11] is the last that worked [00:37:27] phab mail is unrelated [00:40:28] does morebots keep a debug log somewhere? [00:40:45] the docs said there is a testbot in #morebots-test but the channel is empty [00:41:21] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.042 second response time [00:56:19] egoktm hi could you update the site notice on this channel please [00:56:31] legoktm ^^ [00:56:37] Since it still says Status: CI down [00:58:02] Thanks [01:14:54] (03PS3) 10Dzahn: Retry Wikidata dump creation up to three times [puppet] - 10https://gerrit.wikimedia.org/r/293445 (owner: 10Hoo man) [01:15:09] (03CR) 10Dzahn: [C: 032] Retry Wikidata dump creation up to three times [puppet] - 10https://gerrit.wikimedia.org/r/293445 (owner: 10Hoo man) [01:17:52] 06Operations, 07Graphite: "carbon-cache too many creates" on graphite1001 - https://phabricator.wikimedia.org/T137380#2366699 (10Dzahn) [01:18:16] 06Operations, 10Monitoring, 07Graphite: "carbon-cache too many creates" on graphite1001 - https://phabricator.wikimedia.org/T137380#2366712 (10Dzahn) [01:25:22] legoktm: I've been getting HTML e-mail from Phabricator Phabricator for a while now. [01:25:38] And Evan just did better "edited task description" e-mails with diffs! [01:27:20] 06Operations, 10ops-eqiad: mw1063 broken - https://phabricator.wikimedia.org/T137381#2366726 (10Dzahn) [01:27:28] Oooh :D [01:27:48] https://secure.phabricator.com/T7643 has screenshots. [01:30:10] ACKNOWLEDGEMENT - Host mw1063 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T137381 [01:30:29] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [01:31:08] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2099 MB (3% inode=96%): /srv/swift-storage/sdl1 112647 MB (5% inode=91%) [02:26:33] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.4) (duration: 11m 00s) [02:27:39] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:29] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time [02:48:37] (03PS1) 10Microchip08: Redirect phabricator.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/293464 (https://phabricator.wikimedia.org/T137252) [02:49:01] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 11m 02s) [02:52:10] 06Operations, 10DNS, 10Phabricator, 10Traffic, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2366860 (10MC8) It looks like redirects existed for Bugzilla, so I guess you could call this a regression. [02:53:56] !log ms-be2012 ran out of disk due to huge syslog, deleted log, restarted rsyslogd [02:55:20] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jun 9 02:55:20 UTC 2016 (duration 6m 19s) [03:01:19] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:09] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time [03:17:08] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:58] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.025 second response time [03:22:26] 06Operations, 10DNS, 10Phabricator, 10Traffic, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2366879 (10MZMcBride) >>! In T137252#2362373, @Krenair wrote: > Isn't commons.wikipedia.org just a historical thing? Yeah. >>! I... [03:37:51] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2366885 (10Dereckson) Do a manual run of l10nupdate on Tin to check if all is now fine perhaps? [04:00:57] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:05:21] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.225 second response time [04:06:30] PROBLEM - MD RAID on mw2216 is CRITICAL: Timeout while attempting connection [04:06:30] PROBLEM - MD RAID on mw2215 is CRITICAL: Timeout while attempting connection [04:07:21] PROBLEM - Apache HTTP on mw2215 is CRITICAL: Connection timed out [04:07:30] PROBLEM - configured eth on mw2216 is CRITICAL: Timeout while attempting connection [04:07:30] PROBLEM - configured eth on mw2215 is CRITICAL: Timeout while attempting connection [04:07:40] PROBLEM - dhclient process on mw2216 is CRITICAL: Timeout while attempting connection [04:07:40] PROBLEM - dhclient process on mw2215 is CRITICAL: Timeout while attempting connection [04:07:41] PROBLEM - mediawiki-installation DSH group on mw2215 is CRITICAL: Host mw2215 is not in mediawiki-installation dsh group [04:07:41] PROBLEM - mediawiki-installation DSH group on mw2216 is CRITICAL: Host mw2216 is not in mediawiki-installation dsh group [04:08:10] PROBLEM - nutcracker port on mw2216 is CRITICAL: Timeout while attempting connection [04:08:11] PROBLEM - nutcracker port on mw2215 is CRITICAL: Timeout while attempting connection [04:08:30] PROBLEM - nutcracker process on mw2216 is CRITICAL: Timeout while attempting connection [04:08:30] PROBLEM - nutcracker process on mw2215 is CRITICAL: Timeout while attempting connection [04:08:42] PROBLEM - puppet last run on mw2216 is CRITICAL: Timeout while attempting connection [04:08:50] PROBLEM - puppet last run on mw2215 is CRITICAL: Timeout while attempting connection [04:09:00] PROBLEM - salt-minion processes on mw2215 is CRITICAL: Timeout while attempting connection [04:09:00] PROBLEM - salt-minion processes on mw2216 is CRITICAL: Timeout while attempting connection [04:09:20] PROBLEM - Apache HTTP on mw2216 is CRITICAL: Connection timed out [04:09:31] PROBLEM - Check size of conntrack table on mw2216 is CRITICAL: Timeout while attempting connection [04:09:31] PROBLEM - Check size of conntrack table on mw2215 is CRITICAL: Timeout while attempting connection [04:09:50] PROBLEM - DPKG on mw2215 is CRITICAL: Timeout while attempting connection [04:09:50] PROBLEM - DPKG on mw2216 is CRITICAL: Timeout while attempting connection [04:10:10] PROBLEM - Disk space on mw2215 is CRITICAL: Timeout while attempting connection [04:10:10] PROBLEM - Disk space on mw2216 is CRITICAL: Timeout while attempting connection [04:22:39] $ [04:23:18] seems to be the monitoring as every probe fails ^ [04:28:42] RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time [04:31:01] RECOVERY - configured eth on mw2215 is OK: OK - interfaces up [04:31:02] RECOVERY - Check size of conntrack table on mw2215 is OK: OK: nf_conntrack is 0 % full [04:31:11] RECOVERY - dhclient process on mw2215 is OK: PROCS OK: 0 processes with command name dhclient [04:31:41] RECOVERY - Disk space on mw2215 is OK: DISK OK [04:31:51] RECOVERY - nutcracker port on mw2215 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [04:32:00] RECOVERY - MD RAID on mw2215 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [04:32:10] RECOVERY - nutcracker process on mw2215 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [04:32:29] RECOVERY - DPKG on mw2215 is OK: All packages OK [04:33:09] RECOVERY - salt-minion processes on mw2215 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:33:20] RECOVERY - Apache HTTP on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time [04:34:28] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.038 second response time [04:35:28] RECOVERY - DPKG on mw2216 is OK: All packages OK [04:35:58] PROBLEM - Disk space on mw2219 is CRITICAL: Timeout while attempting connection [04:35:58] RECOVERY - Disk space on mw2216 is OK: DISK OK [04:35:59] RECOVERY - configured eth on mw2216 is OK: OK - interfaces up [04:36:18] RECOVERY - nutcracker port on mw2216 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [04:36:19] PROBLEM - MD RAID on mw2219 is CRITICAL: Timeout while attempting connection [04:36:19] RECOVERY - dhclient process on mw2216 is OK: PROCS OK: 0 processes with command name dhclient [04:36:38] RECOVERY - MD RAID on mw2216 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [04:36:39] RECOVERY - nutcracker process on mw2216 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [04:36:49] RECOVERY - salt-minion processes on mw2216 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:36:59] RECOVERY - Check size of conntrack table on mw2216 is OK: OK: nf_conntrack is 0 % full [04:37:09] PROBLEM - configured eth on mw2219 is CRITICAL: Timeout while attempting connection [04:37:18] PROBLEM - Apache HTTP on mw2219 is CRITICAL: Connection timed out [04:37:19] PROBLEM - dhclient process on mw2219 is CRITICAL: Timeout while attempting connection [04:37:39] PROBLEM - mediawiki-installation DSH group on mw2219 is CRITICAL: Host mw2219 is not in mediawiki-installation dsh group [04:38:08] PROBLEM - nutcracker port on mw2219 is CRITICAL: Timeout while attempting connection [04:38:19] PROBLEM - nutcracker process on mw2219 is CRITICAL: Timeout while attempting connection [04:38:38] PROBLEM - puppet last run on mw2219 is CRITICAL: Timeout while attempting connection [04:38:58] PROBLEM - salt-minion processes on mw2219 is CRITICAL: Timeout while attempting connection [04:39:28] PROBLEM - Check size of conntrack table on mw2219 is CRITICAL: Timeout while attempting connection [04:39:40] PROBLEM - DPKG on mw2219 is CRITICAL: Timeout while attempting connection [04:42:58] PROBLEM - NTP on mw2215 is CRITICAL: NTP CRITICAL: Offset unknown [04:46:49] RECOVERY - NTP on mw2215 is OK: NTP OK: Offset -0.009445309639 secs [04:51:48] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 2 failures [04:53:59] PROBLEM - Apache HTTP on mw2215 is CRITICAL: Connection refused [04:56:00] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 2 failures [04:58:19] PROBLEM - Apache HTTP on mw2216 is CRITICAL: Connection refused [05:02:49] RECOVERY - Apache HTTP on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.079 second response time [05:05:49] RECOVERY - MD RAID on mw2219 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [05:05:59] RECOVERY - nutcracker process on mw2219 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [05:06:29] RECOVERY - salt-minion processes on mw2219 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:06:40] RECOVERY - configured eth on mw2219 is OK: OK - interfaces up [05:06:59] RECOVERY - dhclient process on mw2219 is OK: PROCS OK: 0 processes with command name dhclient [05:06:59] RECOVERY - Check size of conntrack table on mw2219 is OK: OK: nf_conntrack is 0 % full [05:07:28] RECOVERY - Disk space on mw2219 is OK: DISK OK [05:07:38] RECOVERY - nutcracker port on mw2219 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [05:08:09] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:09:19] RECOVERY - DPKG on mw2219 is OK: All packages OK [05:10:09] PROBLEM - NTP on mw2219 is CRITICAL: NTP CRITICAL: Offset unknown [05:11:50] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.013 second response time [05:19:59] RECOVERY - NTP on mw2219 is OK: NTP OK: Offset 0.00165784359 secs [05:27:49] PROBLEM - puppet last run on mw2219 is CRITICAL: CRITICAL: Puppet has 2 failures [05:28:19] PROBLEM - Apache HTTP on mw2219 is CRITICAL: Connection refused [05:38:30] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.817 second response time [05:42:44] yuvipanda: ^ [05:50:28] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.795 second response time [05:53:28] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.006 second response time [05:59:29] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.983 second response time [06:01:30] 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2367102 (10Dzahn) [06:02:06] 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2367114 (10Dzahn) a:03faidon [06:06:41] 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2367116 (10Dzahn) meanwhile it's already warning again https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-be2012&service=Disk+space and about 8G .. there is a lot of activity there , PUTs and DELETEs... [06:10:49] ACKNOWLEDGEMENT - Apache HTTP on mw2215 is CRITICAL: Connection refused daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:10:49] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2215 is CRITICAL: Host mw2215 is not in mediawiki-installation dsh group daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:10:49] ACKNOWLEDGEMENT - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:10:49] ACKNOWLEDGEMENT - Apache HTTP on mw2216 is CRITICAL: Connection refused daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:10:49] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2216 is CRITICAL: Host mw2216 is not in mediawiki-installation dsh group daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:10:49] ACKNOWLEDGEMENT - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:10:50] ACKNOWLEDGEMENT - Apache HTTP on mw2219 is CRITICAL: Connection refused daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:10:50] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2219 is CRITICAL: Host mw2219 is not in mediawiki-installation dsh group daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:10:51] ACKNOWLEDGEMENT - puppet last run on mw2219 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn new https://phabricator.wikimedia.org/T135466 [06:13:10] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2299896 (10Dzahn) These have been added to DNS now , mgmt and prod IPs, changes by papaul, i reviewed and merged. papaul started installing servers ..mw2215 thru mw2219 already showing up in Icinga... [06:14:45] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2367125 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/293446/ https://gerrit.wikimedia.org/r/#/c/292307/ https://gerrit.wikimedia.org/r/#/c/293246/ https://gerrit.wikimedia.org/r/#/c/293218/ https:/... [06:20:35] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [06:26:34] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [06:30:45] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:49] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:47] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:17] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:37] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:18] PROBLEM - Disk space on logstash1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%) [06:36:48] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:25] (03PS3) 10Muehlenhoff: Provide the firejail containment for imagemagick's convert(1) on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/293328 (https://phabricator.wikimedia.org/T135111) [06:54:48] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:28] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:55:37] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:37] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:38] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:48] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:10:21] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 10976 MB (3% inode=99%) [07:11:18] what is lithium? [07:13:02] log aggregation via rsyslog, little used AFAICT [07:14:11] RECOVERY - Disk space on lithium is OK: DISK OK [07:14:24] that check seems bogus, there's 73 GB free? [07:14:30] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:16:30] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.357 second response time [07:17:10] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:18:52] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:22:56] !log removed /var/log/logstash/logstash.1 on logstash1001, logspam (similar to the what is described in https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/144) depleted the space on the root partition [07:23:30] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:24:11] RECOVERY - Disk space on logstash1001 is OK: DISK OK [07:43:40] (03CR) 10Elukey: "Alex point is that configuring specific monitors in a puppet module, rather than in a role, brings us to issues like the one we are discus" [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [07:51:19] (03Draft2) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 [07:51:38] (03CR) 10Gehel: Script to do the initial data load from OSM for Maps project (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [08:04:49] (03PS1) 10Gehel: explicitely set input reader format in osm2pgsql [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) [08:09:55] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:11:45] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.008 second response time [08:15:07] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2367183 (10Gehel) [08:15:50] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2367196 (10Gehel) [08:25:36] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: puppet fail [08:25:58] (03PS2) 10Alexandros Kosiaris: ores: Add redis settings to worker nodes in labs [puppet] - 10https://gerrit.wikimedia.org/r/293429 (owner: 10Ladsgroup) [08:26:08] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Add redis settings to worker nodes in labs [puppet] - 10https://gerrit.wikimedia.org/r/293429 (owner: 10Ladsgroup) [08:28:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: Add graphite settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup) [08:32:09] (03CR) 10Alexandros Kosiaris: [C: 031] ores: move config file to /etc/ores [puppet] - 10https://gerrit.wikimedia.org/r/292516 (owner: 10Ladsgroup) [08:38:16] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:24] (03CR) 10Ema: [C: 031] Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [08:38:26] !log rolling restart of app server canaries for libtasn security update [08:40:05] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time [08:42:05] (03PS4) 10Elukey: Force Varnishkafka to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) [08:42:58] (03CR) 10Ema: [C: 031] Force Varnishkafka to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [08:43:35] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [08:43:48] (03CR) 10Elukey: [C: 032] "Puppet compiler looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [08:44:47] akosiaris: o/ shall I merge your ores commit too? [08:44:54] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [08:47:30] seems only labs stuff, probably safe to merge [08:48:17] merging [08:48:44] from https://gerrit.wikimedia.org/r/#/c/293429/ it seems that we are backporting a config from prod to labs [08:49:26] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Puppet has 1 failures [08:49:26] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [08:50:12] (03PS4) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 [08:50:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:50:25] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [08:50:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:50:45] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:51:05] (03PS2) 10Ladsgroup: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) [08:52:24] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:54:59] !log installing libtasn security updates [08:56:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:56:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:05:17] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.013 second response time [09:05:30] !log restarting uwsgi-ores celery-ores-worker in scb1001 and scb1002 [09:08:57] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures [09:09:08] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.030 second response time [09:10:03] (03PS2) 10Gehel: Change expired file zoom level from 16 to 15. [puppet] - 10https://gerrit.wikimedia.org/r/291885 (https://phabricator.wikimedia.org/T136483) [09:12:07] (03CR) 10Gehel: [C: 032] Change expired file zoom level from 16 to 15. [puppet] - 10https://gerrit.wikimedia.org/r/291885 (https://phabricator.wikimedia.org/T136483) (owner: 10Gehel) [09:14:56] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:16:47] !log lowering disk high watermark to rebalance disk usage on elasticsearch eqiad cluster [09:21:07] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:06] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.018 second response time [09:32:35] (03PS4) 10Muehlenhoff: Provide the firejail containment for imagemagick's convert(1) on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/293328 (https://phabricator.wikimedia.org/T135111) [09:36:23] (03PS1) 10Gehel: Configure proxy for HTTPS as well as HTTP in replicate-osm. [puppet] - 10https://gerrit.wikimedia.org/r/293477 [09:37:13] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:14] (03PS2) 10Gehel: Configure proxy for HTTPS as well as HTTP in replicate-osm. [puppet] - 10https://gerrit.wikimedia.org/r/293477 (https://phabricator.wikimedia.org/T134901) [09:46:52] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:46:53] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.051 second response time [09:47:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Provide the firejail containment for imagemagick's convert(1) on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/293328 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [09:53:02] (03PS1) 10Muehlenhoff: Skip installing the imagemagick profile for now [puppet] - 10https://gerrit.wikimedia.org/r/293478 [09:53:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Skip installing the imagemagick profile for now [puppet] - 10https://gerrit.wikimedia.org/r/293478 (owner: 10Muehlenhoff) [09:54:52] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures [09:55:02] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 1 failures [09:55:03] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 1 failures [09:55:32] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 1 failures [09:55:33] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures [09:55:42] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:12] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:13] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:13] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:14] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:14] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:32] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:34] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:42] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:43] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:43] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:44] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:02] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:04] puppet failures should resolve soon [09:57:13] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:13] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:14] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:23] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:23] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:33] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures [09:57:53] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:04] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:04] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:04] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:12] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:13] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:14] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:04] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:13] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:13] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:14] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:22] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:33] PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:42] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:59:43] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:44] PROBLEM - puppet last run on mw1022 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:52] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:53] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:53] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:03] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:22] PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:23] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:23] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:23] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:33] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:43] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:54] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: Puppet has 1 failures [10:01:13] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures [10:06:23] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:42] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:20:52] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:21:03] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:03] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:23] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [10:21:32] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:03] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:22:13] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:22:15] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:15] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:22:23] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:33] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:22:33] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [10:22:43] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:43] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:52] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:22:52] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:23:03] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:13] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:23:22] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:23:22] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:32] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:33] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:42] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:52] RECOVERY - puppet last run on mw1022 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:24:02] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:24:12] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:24:12] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:24:13] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:24:14] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:22] RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:24:53] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:25:04] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:25:14] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:25:14] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:25:22] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:25:23] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:25:33] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:25:43] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:25:44] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:25:52] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:25:53] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:26:04] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:26:13] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:26:22] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:26:22] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:26:23] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:26:32] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:26:33] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:27:03] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:27:03] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:29:03] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [10:35:47] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:37:07] (03CR) 10Alexandros Kosiaris: [C: 031] Configure proxy for HTTPS as well as HTTP in replicate-osm. [puppet] - 10https://gerrit.wikimedia.org/r/293477 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [10:37:37] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:39:27] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.616 second response time [10:39:47] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [10:46:57] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 674 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5572966 keys - replication_delay is 674 [10:48:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [10:48:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [10:49:28] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:50:56] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5526228 keys - replication_delay is 0 [10:51:14] !log Restarting Cassandra on {cerium,praseodymium}.eqiad.wmnet (RESTBase staging) : T126629 [10:51:15] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [10:51:26] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.757 second response time [10:52:13] (03CR) 10Muehlenhoff: [C: 04-1] contint: cleanup gallium / use contint1001 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar) [10:53:55] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2366461 (10JanZerebecki) Based on your existing production access, you should be given the nda group. [10:54:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] explicitely set input reader format in osm2pgsql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [10:54:52] (03CR) 10JanZerebecki: "I would suggest to go with the only winning option." [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [10:55:37] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:56:27] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:57:48] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup) [10:57:55] (03PS3) 10Alexandros Kosiaris: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup) [10:58:01] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup) [10:59:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Right before merging I noticed that the default should also be changed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup) [11:02:54] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [11:03:53] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:05:03] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:05:33] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.026 second response time [11:05:33] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:22] urandom: ^ [11:06:24] known? [11:08:13] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:08:35] urandom went for lunch, sigh [11:08:43] 06Operations: decom furud - https://phabricator.wikimedia.org/T137221#2367426 (10akosiaris) >>! In T137221#2361776, @Dzahn wrote: > @akosiaris have you seen the problem above before when deleting VMs? ^ No, but then again we haven't been deleting VMs much yet. I am not sure what that is, but it might be a netw... [11:08:46] that's staging, though, so no worries [11:09:48] (03CR) 10Alexandros Kosiaris: [C: 031] Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [11:12:14] (03Abandoned) 10KartikMistry: lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/269115 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [11:12:53] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [11:13:14] (03PS1) 10KartikMistry: lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/293484 (https://phabricator.wikimedia.org/T124137) [11:13:23] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [11:13:53] akosiaris: I'm rebuilding all Apertium packages with +wmf version scheme, so it is easier to track when importing from Debian. [11:14:15] akosiaris: I'll ping once base packages are ready. [11:15:08] kart_: ok [11:16:02] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [11:19:14] PROBLEM - cassandra-a service on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:24] (03PS4) 10Ladsgroup: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) [11:21:13] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [11:22:12] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [11:23:55] (03PS1) 10KartikMistry: cg3: New upstream release [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/293485 (https://phabricator.wikimedia.org/T107306) [11:25:48] (03CR) 10Alexandros Kosiaris: [C: 031] Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [11:26:17] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [11:27:23] (03CR) 10jenkins-bot: [V: 04-1] Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [11:27:34] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup) [11:27:39] (03PS5) 10Alexandros Kosiaris: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup) [11:27:44] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup) [11:27:53] (03PS2) 10Gehel: Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) [11:28:04] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] Define backup for contint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [11:30:03] (03PS3) 10Gehel: Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) [11:30:39] (03PS2) 10Alexandros Kosiaris: Fix whitespace issue [puppet] - 10https://gerrit.wikimedia.org/r/293291 (owner: 10Dereckson) [11:31:18] (03CR) 10Gehel: [C: 032] Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [11:33:27] !log manually restarting ores-uwsgi and celery-ores-worker in scb100[12] [11:37:30] (03PS3) 10Gehel: WIP - Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) [11:37:51] (03CR) 10Alexandros Kosiaris: "@Eevans, the cassandra role is not re-used anywhere but the restbase clusters. And that's a good thing. In fact, given that the cassandra " [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [11:38:18] (03PS2) 10Gehel: explicitely set input reader format in osm2pgsql [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) [11:38:34] (03PS3) 10Alexandros Kosiaris: Fix whitespace issue [puppet] - 10https://gerrit.wikimedia.org/r/293291 (owner: 10Dereckson) [11:38:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix whitespace issue [puppet] - 10https://gerrit.wikimedia.org/r/293291 (owner: 10Dereckson) [11:41:23] PROBLEM - cassandra-a service on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:43:23] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [11:44:56] (03PS2) 10Alexandros Kosiaris: Update openldap module's README [puppet] - 10https://gerrit.wikimedia.org/r/292585 [11:45:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Update openldap module's README [puppet] - 10https://gerrit.wikimedia.org/r/292585 (owner: 10Alexandros Kosiaris) [11:48:13] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [11:48:24] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2367466 (10Cmjohnson) [11:51:27] mobrovac: yeah, sorry, it might do that... [11:51:41] mobrovac: i'm pushing it hard, trying to break it [11:52:14] no pb, urandom, better for it to be a controlled failure than a wtf :) [11:52:51] mobrovac: well, and it helps that it's not a production syste [11:52:52] m [11:53:03] that too :P [11:53:06] though it would be nice if it weren't monitored as such [11:53:42] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:38] (03PS2) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 [11:56:40] (03PS28) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [11:57:21] (03CR) 10jenkins-bot: [V: 04-1] networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris) [11:57:27] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [11:57:32] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time [11:59:18] (03CR) 10Alexandros Kosiaris: "@Faidon, mw_appserver_networks and analytics_networks are being done in the followup patch. Cleaning up the ERB defs is something I 'd lik" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [12:00:23] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:14] (03CR) 10JanZerebecki: "In case Alexandros comment wasn't to be understood as this is a social issue. I think the preferred thing in operations.git these days is " [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [12:04:57] (03Abandoned) 10Gergő Tisza: Clean up AuthManager configuration (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293440 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [12:06:45] !log Temporarily disabling puppet on xenon.eqiad.wmnet to test settings : T126629 [12:06:46] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [12:10:45] 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367544 (10mobrovac) [12:11:47] !log Restarting Cassandra on xenon.eqiad.wmnet : T126629 [12:11:48] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [12:12:13] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [12:12:27] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2367569 (10Gehel) It seems that at the moment only .js and .css files which ar in the js / css directories are processed by filerev. Is there any reason to not... [12:14:24] 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367571 (10MoritzMuehlenhoff) For background: firejail starts the various service in a separate namespace. "firejail --join" allows a user to join that namespace a... [12:14:33] (03PS1) 10Gergő Tisza: Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) [12:15:43] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:16:36] (03PS1) 10Gehel: Don't publish etags for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/293492 (https://phabricator.wikimedia.org/T137238) [12:17:43] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [12:19:21] !log deploying [[gerrit:293459]] to fix morebots (T137377) [12:19:22] T137377: all morebots stopped listening to !log lines - https://phabricator.wikimedia.org/T137377 [12:19:43] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:20:19] thcipriani|afk: ^ (just in case you are still working on tin) [12:21:17] (03CR) 10Luke081515: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/293464 (https://phabricator.wikimedia.org/T137252) (owner: 10Microchip08) [12:24:24] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:24:51] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2367600 (10KartikMistry) [12:26:23] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [12:27:34] (03PS1) 10KartikMistry: hfst: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/293494 (https://phabricator.wikimedia.org/T95653) [12:29:43] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:15] !log tgr@tin Synchronized php-1.28.0-wmf.5/extensions/LdapAuthentication/LdapPrimaryAuthenticationProvider.php: deploy [[gerrit:293459]] to fix wikitech API login / morebots (T137377) (duration: 00m 47s) [12:33:16] T137377: all morebots stopped listening to !log lines - https://phabricator.wikimedia.org/T137377 [12:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:51] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:45] (03PS1) 10KartikMistry: apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/293497 (https://phabricator.wikimedia.org/T107306) [12:37:30] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:31] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [12:47:12] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:40] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [12:50:59] !log Restarting Cassandra on xenon.eqiad.wmnet : T126629 [12:51:00] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [12:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:11] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:19] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2367721 (10Jonas) We can only use filerev for files that are referenced from the html file, because the application itself is not aware about filerev and then will not find... [13:03:23] (03CR) 10Hashar: "Few replies to Moritz." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar) [13:06:11] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418#2367728 (10hashar) [13:08:00] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:20] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:41] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.029 second response time [13:12:49] (03PS1) 10Hashar: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) [13:12:51] RECOVERY - Host mw1063 is UP: PING OK - Packet loss = 0%, RTA = 2.21 ms [13:13:31] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:20] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:30] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [13:15:31] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:16:33] (03PS5) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 [13:18:50] (03CR) 10Anomie: [C: 031] Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [13:20:21] (03CR) 10JanZerebecki: [C: 031] "After looking at the change again. Forget what I said, correct solution." [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [13:20:21] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [13:21:11] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:30] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:18] (03CR) 10Alexandros Kosiaris: [C: 031] "@Elukey, on 1) since you 've justified adequately removing these redundant checks and are OK with you and andrew receiving the alarms and " [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [13:25:21] (03PS4) 10Gehel: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) [13:26:39] !log Restarting Cassandra on xenon.eqiad.wmnet to apply 2G file cache : T137419 [13:26:40] T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419 [13:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:27:12] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:20] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [13:28:28] (03PS2) 10Hashar: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) [13:33:51] (03PS3) 10Hashar: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) [13:34:16] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:37] (03CR) 10Hashar: "I have ran patchset 1 through the puppet compiler and noticed the ferm rule that allow Gearman on gallium would be dropped which would cau" [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar) [13:36:06] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [13:38:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "two inline comments, otherwise looks good to me" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [13:39:46] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:53] (03PS1) 10Anomie: Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 [13:41:58] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:22] (03PS2) 10Elukey: Remove old and redundant AQS specific alarms. [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) [13:42:44] (03CR) 10Hashar: "I will restore /etc/default/zuul-merger to the default from the deb package and then disable the service on boot with:" [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar) [13:44:45] (03CR) 10Elukey: [C: 032] "Thanks for all the comments! I am going to merge this code review and then follow up in https://phabricator.wikimedia.org/T137422" [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [13:45:21] !log change-prop deploying 2161403c [13:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:38] (03CR) 10Hashar: [C: 04-1] contint: limit access to zuul-merger git daemon (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293449 (https://phabricator.wikimedia.org/T137323) (owner: 10Dzahn) [13:50:16] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 5 failures [13:51:23] (03PS4) 10Hashar: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) [13:52:30] !log change-prop restarting on scb1002 for update [13:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:56] (03PS5) 10Gehel: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) [13:54:17] RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888 [13:54:27] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:47] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 711 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5545535 keys - replication_delay is 711 [13:56:31] (03CR) 10Gehel: [C: 04-1] "tests need to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [13:56:47] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 2 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2367839 (10hashar) @Dzahn thanks, though all those rules are indeed present on ho... [13:57:07] PROBLEM - cassandra-a service on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:35] Icinga check jenkins_zmq_publisher on contint1001 can be ignored. Jenkins is not running there [13:57:37] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:46] been doing a nc -l on the host for tests purposes [13:58:17] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.172 second response time [13:59:07] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [13:59:21] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 3 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2367841 (10Dzahn) [14:00:16] (03CR) 10Hashar: "I have dropped the explicit ferm rule for Zuul server -- gearman since ferm allow all traffic on 127.0.0.1." [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar) [14:00:26] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [14:00:40] ^^^ jenkins_zmq_publisher can be ACK / ignored [14:02:57] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5530360 keys - replication_delay is 0 [14:04:17] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:04:45] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:25] PROBLEM - Apache HTTP on mw1263 is CRITICAL: Connection timed out [14:05:27] PROBLEM - Check size of conntrack table on mw1266 is CRITICAL: Timeout while attempting connection [14:05:27] PROBLEM - dhclient process on mw1263 is CRITICAL: Timeout while attempting connection [14:05:45] PROBLEM - mediawiki-installation DSH group on mw1263 is CRITICAL: Host mw1263 is not in mediawiki-installation dsh group [14:05:55] PROBLEM - DPKG on mw1266 is CRITICAL: Timeout while attempting connection [14:06:15] PROBLEM - Disk space on mw1266 is CRITICAL: Timeout while attempting connection [14:06:15] PROBLEM - nutcracker port on mw1263 is CRITICAL: Timeout while attempting connection [14:06:35] PROBLEM - nutcracker process on mw1263 is CRITICAL: Timeout while attempting connection [14:06:35] PROBLEM - MD RAID on mw1266 is CRITICAL: Timeout while attempting connection [14:06:45] PROBLEM - puppet last run on mw1263 is CRITICAL: Timeout while attempting connection [14:06:52] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2367846 (10Ottomata) [14:06:55] PROBLEM - salt-minion processes on mw1263 is CRITICAL: Timeout while attempting connection [14:07:05] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2367848 (10Ottomata) p:05Triage>03Normal [14:07:15] PROBLEM - Apache HTTP on mw1266 is CRITICAL: Connection timed out [14:07:17] PROBLEM - configured eth on mw1266 is CRITICAL: Timeout while attempting connection [14:07:35] PROBLEM - dhclient process on mw1266 is CRITICAL: Timeout while attempting connection [14:07:35] PROBLEM - Check size of conntrack table on mw1263 is CRITICAL: Timeout while attempting connection [14:07:39] !log Re-enabling puppet on xenon.eqiad.wmnet, forcing a run, and restarting Cassandra : T137419 [14:07:40] T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419 [14:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:45] PROBLEM - mediawiki-installation DSH group on mw1266 is CRITICAL: Host mw1266 is not in mediawiki-installation dsh group [14:07:47] PROBLEM - DPKG on mw1263 is CRITICAL: Timeout while attempting connection [14:08:06] PROBLEM - nutcracker port on mw1266 is CRITICAL: Timeout while attempting connection [14:08:06] PROBLEM - Disk space on mw1263 is CRITICAL: Timeout while attempting connection [14:08:25] PROBLEM - MD RAID on mw1263 is CRITICAL: Timeout while attempting connection [14:08:25] PROBLEM - nutcracker process on mw1266 is CRITICAL: Timeout while attempting connection [14:08:26] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [14:08:45] PROBLEM - puppet last run on mw1266 is CRITICAL: Timeout while attempting connection [14:08:56] PROBLEM - salt-minion processes on mw1266 is CRITICAL: Timeout while attempting connection [14:09:15] PROBLEM - configured eth on mw1263 is CRITICAL: Timeout while attempting connection [14:12:13] !log change-prop restarting on scb1001 for update [14:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:26] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:46] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:17:31] 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367865 (10JanZerebecki) 05Open>03Resolved a:03JanZerebecki Done. sc-admins share the namespaces of all services. (Except the tmpfs which the task indicates... [14:18:37] PROBLEM - NTP on mw1063 is CRITICAL: NTP CRITICAL: No response from NTP server [14:18:45] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [14:19:02] (03PS1) 10KartikMistry: apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/293507 (https://phabricator.wikimedia.org/T107306) [14:22:44] (03CR) 10KartikMistry: "Can this be issue like, https://unix.stackexchange.com/questions/167533/what-does-gbperror-upstream-1-5-13-is-not-a-valid-treeish-mean ?" [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/270671 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [14:26:46] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:45] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time [14:29:09] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me now." [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar) [14:33:00] !log stopped / disabled zuul-merger on gallium T137418 [14:33:01] T137418: Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418 [14:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:33] (03PS5) 10Muehlenhoff: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar) [14:34:45] RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.006 second response time [14:35:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar) [14:35:25] (03CR) 10Alexandros Kosiaris: [C: 031] Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) (owner: 10Elukey) [14:36:35] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.007 second response time [14:37:57] !log Removing zuul-merger from gallium [14:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:35] !log Tested temp setting retention.bytes=2G for Analytics kafka topic webrequest_misc [14:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:45] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Puppet has 1 failures [14:39:45] RECOVERY - Disk space on mw1266 is OK: DISK OK [14:39:54] RECOVERY - salt-minion processes on mw1266 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:40:04] RECOVERY - configured eth on mw1266 is OK: OK - interfaces up [14:40:05] RECOVERY - dhclient process on mw1266 is OK: PROCS OK: 0 processes with command name dhclient [14:40:35] RECOVERY - nutcracker port on mw1266 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:40:35] RECOVERY - salt-minion processes on mw1263 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:40:38] (03PS4) 10Ottomata: Add ssh:userkey for eventlogging scap::targets [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [14:40:44] RECOVERY - Check size of conntrack table on mw1263 is OK: OK: nf_conntrack is 0 % full [14:40:44] RECOVERY - MD RAID on mw1263 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:40:44] RECOVERY - nutcracker process on mw1266 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:40:45] RECOVERY - Disk space on mw1063 is OK: DISK OK [14:40:52] 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367957 (10JanZerebecki) 05Resolved>03Open Sorry I forgot the main part of the request, the other namespaces besides file system. Will upload a patch. [14:40:54] RECOVERY - MD RAID on mw1266 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:40:55] RECOVERY - nutcracker process on mw1263 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:40:56] RECOVERY - salt-minion processes on mw1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:40:56] RECOVERY - Check size of conntrack table on mw1266 is OK: OK: nf_conntrack is 0 % full [14:40:56] RECOVERY - DPKG on mw1263 is OK: All packages OK [14:41:05] RECOVERY - configured eth on mw1063 is OK: OK - interfaces up [14:41:08] those are mine, new app servers! [14:41:24] PROBLEM - NTP on mw1263 is CRITICAL: NTP CRITICAL: Offset unknown [14:41:25] RECOVERY - dhclient process on mw1063 is OK: PROCS OK: 0 processes with command name dhclient [14:41:25] RECOVERY - Disk space on mw1263 is OK: DISK OK [14:41:28] not sure why mw1063 still pops up [14:41:45] RECOVERY - configured eth on mw1263 is OK: OK - interfaces up [14:42:04] RECOVERY - dhclient process on mw1263 is OK: PROCS OK: 0 processes with command name dhclient [14:42:05] RECOVERY - nutcracker port on mw1263 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:44:45] RECOVERY - DPKG on mw1063 is OK: All packages OK [14:44:49] (03PS6) 10Gehel: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) [14:45:24] RECOVERY - DPKG on mw1266 is OK: All packages OK [14:45:35] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:29] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2367970 (10hashar) [14:46:31] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418#2367967 (10hashar) 05Open>03Resolved Puppet ran just fine, and the iptables rules looks ok. I have manually cleaned up the host:... [14:47:34] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 9.108 second response time [14:47:46] !log change-prop stopped on scb1002 [14:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:16] (03CR) 10Ottomata: [C: 032] Add ssh:userkey for eventlogging scap::targets [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [14:50:08] 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367977 (10mobrovac) @JanZerebecki I think there must be a misunderstanding here. This is an access request for `sc[ab][12]00[12]` hosts in production. [14:52:05] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:52:52] (03CR) 10Alexandros Kosiaris: [C: 031] Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [14:53:01] (03CR) 10Gergő Tisza: [C: 031] Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 (owner: 10Anomie) [14:55:05] RECOVERY - NTP on mw1063 is OK: NTP OK: Offset 0.002803564072 secs [14:55:14] RECOVERY - NTP on mw1263 is OK: NTP OK: Offset 0.0001165866852 secs [14:55:26] thcipriani, gehel, want to use morning SWAT for another scap3 try? [14:55:40] (03PS1) 10JanZerebecki: Allow firefail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) [14:55:50] yurik: still trying to figure out what could have happened with tilerator :\ [14:56:06] * gehel is available [14:56:45] (03CR) 10jenkins-bot: [V: 04-1] Allow firefail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) (owner: 10JanZerebecki) [14:56:52] thcipriani: just let me know if there is something I can do to help, I had a quick look and don't see what was wrong... [14:59:15] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Puppet has 2 failures [14:59:15] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Puppet has 2 failures [14:59:18] thcipriani, poke akosiaris, i'm sure its all his fault :-P [14:59:35] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: Puppet has 2 failures [14:59:58] akosiaris, context - for some reason, tilerator switch to scap3 fails to chown its dir [15:00:04] anomie, ostriches, thcipriani, marktraceur, and aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1500). [15:00:29] gehel: yeah, my suspicion is it has something to do with the fact that there are multiple services that both use scap::target on the same box, but so far I can't see how that is causing a problem. Also it's true that we have multiple scap::targets on other boxes seemingly without issue. [15:00:36] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2368015 (10Sebastian_Berlin-WMSE) >>! In T706#2364293, @Aklapper wrote: > Now that [[ https://www.mediawiki.org/wiki/Phabricator/Project_m... [15:01:46] nothing for swat [15:02:04] I can add a patch in a few minutes. [15:02:09] a tilerator has moved to scap3 ? great [15:03:03] akosiaris: tilerator is *trying* to move to scap3... but failing at this point [15:03:06] (03CR) 10Alexandros Kosiaris: "It's git tags missing from the repo." [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/270671 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [15:03:08] well, kartotherian moved, tilerator had some ownership issues that scap::target should take care of on targets that I'm trying to figure out. [15:04:00] PROBLEM - Apache HTTP on mw1263 is CRITICAL: Connection refused [15:04:28] this is still a new appserver [15:04:30] PROBLEM - Apache HTTP on mw1266 is CRITICAL: Connection refused [15:06:20] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:06:57] (03PS1) 10Dereckson: Add *.nara.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) [15:07:10] (03CR) 10Gehel: [C: 04-1] "This should be merged at the same time as https://gerrit.wikimedia.org/r/#/c/293475/" [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [15:07:16] aude: ^ here you are if you wish to SWAT something, I can add this patch to the Deployments table for this . [15:07:19] morning SWAT [15:07:33] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2368049 (10Aklapper) All patches seem to be merged. What are the next steps in this task? [15:09:11] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [15:09:31] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:11:15] Dereckson: looking [15:11:31] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5533846 keys - replication_delay is 0 [15:12:39] (03CR) 10Aude: Add *.nara.gov to wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) (owner: 10Dereckson) [15:13:04] (03PS7) 10Yurik: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [15:13:16] (03PS1) 10JanZerebecki: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 [15:13:42] (03CR) 10Dereckson: "Fixing that." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) (owner: 10Dereckson) [15:13:56] (03CR) 10jenkins-bot: [V: 04-1] Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel) [15:14:09] Dereckson: can deploy the patch [15:14:23] (03PS2) 10Dereckson: Add *.nara.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) [15:14:28] ^ with spaces to align [15:14:34] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2368064 (10BBlack) There's still long-term investigation ongoing on the effects of jemalloc tuning and the effect on frontend hitrates (and the latter in conjuction with co... [15:14:46] thanks [15:15:13] Does phabricator.wikimedia.org use nginx with apache or apache only. [15:15:17] Just wondering. [15:15:26] please [15:15:29] (03PS2) 10JanZerebecki: Allow firefail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) [15:15:54] (03CR) 10Aude: [C: 032] Add *.nara.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) (owner: 10Dereckson) [15:16:41] paladox: modules/role/manifests/phabricator/main.pp [15:16:53] Dereckson ok thanks [15:16:58] no trace of nginx [15:17:27] (03Merged) 10jenkins-bot: Add *.nara.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) (owner: 10Dereckson) [15:18:15] paladox: do you need help to set up Phabricator somewhere with nginx? [15:18:28] Dereckson nope, im testing locally. [15:18:42] And found that the websites using nginx are faster then apache. [15:19:15] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Add *.nara.gov to wgCopyUploadDomains (duration: 00m 40s) [15:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:22] done [15:19:32] (03PS1) 10Thcipriani: tilerator to scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/293518 [15:19:34] Testing. [15:19:37] paladox: https://github.com/nasqueron/docker-phabricator/blob/master/files/etc/nginx/sites-available/default [15:19:40] thanks [15:19:56] Dereckson thanks [15:22:07] Error fetching URL: SSL certificate problem: unable to get local issuer certificate [15:22:15] !log Cleaning git-daemon on gallium (was used by zuul-merger) T137418 [15:22:16] T137418: Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418 [15:22:16] :( [15:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:21] aude: works in HTTP [15:23:26] hmmm [15:24:10] mutante: an US archives site have some certificates issues too, not publishing the full certificates chain. That's not reserved to our bloggers. [15:24:30] aude: that's what happen when only the final certificate is published, not the intermediate between the root CA and this one [15:24:45] browsers download intermediate certificates, curl doesn't [15:25:07] if there is something NARA needs to fix, suppose we can poke dominic and maybe he can poke the right people [15:27:57] (03PS3) 10Gehel: explicitely set input reader format in osm2pgsql [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) [15:30:18] 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2368115 (10Andrew) 05Open>03Resolved I believe we've now updated everything to use the latest version of qemu, a modern version from the cloud archive. [15:31:32] aude: Yes there is something. I've checked on https://www.ssllabs.com/ssltest/analyze.html?d=clinton4.nara.gov&s=2620%3a0%3a2b0%3a10f1%3a0%3a0%3a0%3a109 this is the same issue than explained at https://phabricator.wikimedia.org/P3001#13606 [15:31:43] !log added topic override retention.bytes=536870912000 to Kafka webrequest_text (T136690) [15:31:44] T136690: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690 [15:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:03] aude: They need to concatenate Entrust Certification Authority - L1K + their certificate. [15:35:32] I'm reporting that on the task, and warn they should currently use http:// for server side upload. [15:35:39] Thanks for the deploy aude. [15:36:19] (03CR) 10Florianschmidtwelzow: [C: 031] Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [15:38:47] (03PS2) 10Thcipriani: WIP: tilerator to scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/293518 [15:42:27] (03PS3) 10Thcipriani: WIP: tilerator to scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/293518 [15:46:22] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365971 (10hashar) [15:46:26] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 3 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2368164 (10hashar) 05Open>03stalled From a quick chat with @mark we dont want... [15:49:55] (03PS4) 10Thcipriani: WIP: tilerator to scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/293518 [15:52:49] 06Operations, 10Traffic, 13Patch-For-Review: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2368210 (10ema) It looks like the culprit might be lack of grouping at the varnishlog level. The following experiment is currently ongoing on cp1061:... [15:53:52] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2368213 (10hashar) Following {T137323}, @mark stated that there should be no traffic between the private network and labs... [15:57:51] (03Abandoned) 10Ema: varnishapi.py: reset error message [puppet] - 10https://gerrit.wikimedia.org/r/293132 (owner: 10Ema) [15:58:11] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.063 second response time [15:58:23] (03PS3) 10Elukey: Limit the maximum Kafka topic partition size to 500GB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) [15:59:22] (03CR) 10Ottomata: [C: 031] Limit the maximum Kafka topic partition size to 500GB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) (owner: 10Elukey) [16:00:04] godog and moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1600). [16:00:04] mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:07] (03PS2) 10Ema: varnishlog4.py: log errors in execute() [puppet] - 10https://gerrit.wikimedia.org/r/293123 (https://phabricator.wikimedia.org/T137114) [16:00:31] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368243 (10Gehel) We shoudl be able to also process at least: * vendor/jquery.uls/css/jquery.uls.css * logo.svg They are unlikely to change frequently but still it would b... [16:03:35] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2368252 (10Milimetric) a:03elukey [16:03:41] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2368253 (10Milimetric) a:03elukey [16:04:31] anybody doing puppetswat? [16:04:39] what do you need? [16:05:11] https://gerrit.wikimedia.org/r/#/c/292573/ ? [16:05:21] a noop - https://gerrit.wikimedia.org/r/#/c/292573/ [16:05:22] yes [16:05:52] (03PS2) 10Faidon Liambotis: Change Prop: Use the URIs for MW and RB from service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/292573 (owner: 10Mobrovac) [16:05:59] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Change Prop: Use the URIs for MW and RB from service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/292573 (owner: 10Mobrovac) [16:07:30] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:08:37] thnx paravoid! [16:09:01] RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.196 second response time [16:17:31] (03PS4) 10Elukey: Limit the maximum Kafka topic partition size to 500GB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) [16:18:41] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:20:01] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [16:20:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [16:22:11] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:23:10] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:11] barium ^^^ looking [16:24:11] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [16:25:10] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:25:11] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 9.792 second response time [16:27:40] (03CR) 10Elukey: [C: 032] Limit the maximum Kafka topic partition size to 500GB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) (owner: 10Elukey) [16:32:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293521 (https://phabricator.wikimedia.org/T137433) (owner: 10Ladsgroup) [16:35:38] (03PS2) 10Ladsgroup: ores: Move CORS to uwsgi from nginx [puppet] - 10https://gerrit.wikimedia.org/r/293521 (https://phabricator.wikimedia.org/T137433) [16:36:36] (03CR) 10BBlack: [C: 031] varnishlog4.py: log errors in execute() [puppet] - 10https://gerrit.wikimedia.org/r/293123 (https://phabricator.wikimedia.org/T137114) (owner: 10Ema) [16:37:20] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [16:37:34] (03PS5) 10Elukey: Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) [16:37:40] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368340 (10Jonas) >>! In T137238#2368243, @Gehel wrote: > We shoudl be able to also process at least: > * vendor/jquery.uls/css/jquery.uls.css This is fixed. > * main pag... [16:37:52] (03PS3) 10Alexandros Kosiaris: ores: Move CORS to uwsgi from nginx [puppet] - 10https://gerrit.wikimedia.org/r/293521 (https://phabricator.wikimedia.org/T137433) (owner: 10Ladsgroup) [16:37:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Move CORS to uwsgi from nginx [puppet] - 10https://gerrit.wikimedia.org/r/293521 (https://phabricator.wikimedia.org/T137433) (owner: 10Ladsgroup) [16:39:14] (03PS6) 10Elukey: Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) [16:44:15] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2368393 (10RobH) [16:44:52] (03CR) 10Elukey: [C: 032] Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) (owner: 10Elukey) [16:45:37] (03PS2) 10Hoo man: Enable ArticlePlaceholder extension in guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291589 (https://phabricator.wikimedia.org/T136517) (owner: 10KartikMistry) [16:45:54] 06Operations, 10procurement: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T137439#2368426 (10RobH) [16:46:25] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T137439#2368431 (10RobH) [16:46:51] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:47:51] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time [16:50:10] (03PS1) 10Hoo man: Enable the ArticlePlaceholder on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293525 (https://phabricator.wikimedia.org/T136100) [16:50:12] (03PS1) 10Hoo man: Enable the ArticlePlaceholder on nnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293526 (https://phabricator.wikimedia.org/T130997) [16:52:10] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.903 second response time [16:52:42] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T137439#2368449 (10RobH) a:05RobH>03fgiunchedi So we'll need @fgiunchedi to offer input on this, as it is nearly identical to T136631. With the 6 new swift backends, we need to know if they wi... [16:53:13] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2341913 (10RobH) a:03fgiunchedi Filippo is back shortly, so assigning to him for input. Please provide feedback and assign back to me for followup, thank you! [16:53:16] (03PS1) 10Elukey: Add new appservers to the related DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/293527 [16:58:31] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368472 (10Smalyshev) > * main page (index.html) -> no cache (or very short) I would cache it for the same as below. It doesn't change that much - pretty much once a week n... [16:58:53] (03PS2) 10Catrope: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) [16:59:10] (03PS2) 10Catrope: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) [16:59:44] (03PS5) 10Thcipriani: Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1700). Please do the needful. [17:00:13] i'll skip [17:00:58] (03CR) 10Alexandros Kosiaris: [C: 032] lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/293484 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [17:03:20] (03PS6) 10Thcipriani: Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 [17:04:40] (03PS1) 10Ema: varnishlog4: default to request grouping [puppet] - 10https://gerrit.wikimedia.org/r/293530 (https://phabricator.wikimedia.org/T137114) [17:04:41] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures [17:06:01] (03CR) 10Thcipriani: "Puppet compiler is finally happy and aware that there is a change: https://puppet-compiler.wmflabs.org/3078/" [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani) [17:10:22] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:12] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time [17:14:41] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [17:15:11] --^ just re-enabled it [17:16:42] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:16:46] good [17:18:12] (03CR) 10RobH: [C: 031] Add new appservers to the related DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/293527 (owner: 10Elukey) [17:18:32] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:38] (03CR) 10Elukey: [C: 032] Add new appservers to the related DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/293527 (owner: 10Elukey) [17:19:50] !log change-prop deploying ecfda93f09d [17:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:02] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [17:20:22] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.014 second response time [17:29:35] (03CR) 10Alexandros Kosiaris: [C: 031] Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani) [17:31:12] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:32:20] (03CR) 10Mobrovac: [C: 04-1] Deploy Tilerator with Scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani) [17:39:01] (03CR) 10Alexandros Kosiaris: [C: 032] cg3: New upstream release [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/293485 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [17:41:31] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.920 second response time [17:41:43] (03PS2) 10Muehlenhoff: Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) [17:41:49] (03PS7) 10Thcipriani: Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 [17:43:02] !log Restarting Cassandra on xenon.eqiad.wmnet (use exponentially decaying resevoirs for metrics histograms) : T126629 [17:43:02] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [17:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:41] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 24.910 second response time [17:46:51] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:58:52] RECOVERY - mediawiki-installation DSH group on mw1264 is OK: OK [18:00:04] hoo and frimelle: Dear anthropoid, the time has come. Please deploy ArticlePlaceholder (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1800). [18:01:11] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:19] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.535 second response time [18:08:28] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2124 MB (3% inode=96%) [18:09:39] RECOVERY - mediawiki-installation DSH group on mw1263 is OK: OK [18:11:39] RECOVERY - mediawiki-installation DSH group on mw1266 is OK: OK [18:14:58] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.762 second response time [18:16:59] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.536 second response time [18:19:08] PROBLEM - test icmp reachability to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 82 probes of 394 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [18:22:29] (03Abandoned) 10EBernhardson: Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson) [18:22:46] (03Abandoned) 10EBernhardson: Remove old ES nodes from master capable list [puppet] - 10https://gerrit.wikimedia.org/r/251025 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson) [18:23:11] (03PS1) 10MaxSem: Move everything Postgres-related out of role::maps::server [puppet] - 10https://gerrit.wikimedia.org/r/293540 [18:23:17] gehel, ^ [18:23:43] (03CR) 10EBernhardson: "ping for merge?" [puppet] - 10https://gerrit.wikimedia.org/r/283604 (owner: 10EBernhardson) [18:23:48] (03PS4) 10EBernhardson: Make mwrepl a little more user friendly [puppet] - 10https://gerrit.wikimedia.org/r/283604 [18:28:00] (03CR) 10Mobrovac: [C: 031] Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani) [18:29:18] MaxSem: thanks! Busy right now, I'll have a look as soon as I can... [18:33:18] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:57] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, 13Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2368912 (10EBernhardson) 05Open>03declined These nodes are being removed from the clu... [18:34:08] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.811 second response time [18:34:28] (03CR) 10Jforrester: "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [18:34:28] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time [18:36:09] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.324 second response time [18:55:05] I just spent an hour waiting for various Jenkins jobs… fun way to spend time :P [18:56:12] hoo jenkins seems to be taking along time. Since gallium failed yesturday. [18:57:10] Don't think it's slower than usual [18:58:32] ly [18:59:22] * bawolff thinks its kind of a bit slow [18:59:51] hm… maybe [19:00:05] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1900). [19:00:39] … and there goes my deployment slot [19:00:42] holding train for: https://phabricator.wikimedia.org/T137404 [19:01:39] thcipriani: Do you think/ not think that's Wikidata related? [19:01:59] … also, can I continue with my deploy now? [19:02:49] hoo: unsure about the cause. Feel free to continue your deploy, looks like there's some work to be done before the train rolls yet. [19:03:21] Probably going to look into the bug after [19:03:34] quite likely that is wikidata related [19:03:54] (we mess with interwiki links on all levels, including in the Skin) [19:05:48] thanks :) [19:19:11] o_O [19:25:17] RECOVERY - test icmp reachability to codfw on ripe-atlas-codfw is OK: OK - failed 7 probes of 394 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:26:49] (03CR) 10Kaldari: [C: 031] Revert "Revert "Set Tamil projects to use uca-ta collation"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) (owner: 10Dereckson) [19:29:50] (03CR) 10Dereckson: "Dependent change has been merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) (owner: 10Dereckson) [19:34:59] (03Abandoned) 10Dzahn: contint: limit access to zuul-merger git daemon [puppet] - 10https://gerrit.wikimedia.org/r/293449 (https://phabricator.wikimedia.org/T137323) (owner: 10Dzahn) [19:40:57] thcipriani: looks like the mw wiki trains is on hold. Mind if i do a quick mobileapps deploy now? [19:41:37] bearND: it is on hold. mobileapps deploy should be fine. [19:41:52] thcipriani: thanks [19:42:05] !log starting mobileapps deploy [19:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:18] !log mobileapps deployed 71ff97c [19:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:38] thcipriani: done. Thanks! //cc: mdholloway mobrovac [19:44:59] bearND: ack. Thanks. [19:47:54] !log hoo@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata: Update ArticlePlaceholder (duration: 02m 04s) [19:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:25] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [19:53:26] https://phabricator.wikimedia.org/T116404 bit us [19:53:31] !log hoo@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata: revert, possible s5 master overload (duration: 01m 57s) [19:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:41] Forget the s5 master thing [19:53:42] https://phabricator.wikimedia.org/T116404 [19:53:47] crap :( [19:55:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:55:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:58:14] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:58:19] Should not be related to my changes (anymore) [19:58:22] reverted them [19:59:15] !log Restarting Cassandra on xenon.eqiad.wmnet (removing patched test build; restoring state) : T137474 [19:59:16] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [19:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:24] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:02:05] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2369349 (10hoo) p:05Low>03High We just hit this hard when we changed the query traffic patterns towards... [20:02:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:03:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:05:57] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2369353 (10Dzahn) @JanZerebecki is right. Ladsgroup already has existing production access / deployer and i see he signed L2 in th... [20:06:30] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2369354 (10Dzahn) @Ladsgroup you should now be able to login on graphite (and icinga). [20:07:06] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2369355 (10Dzahn) [20:07:16] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2366461 (10Dzahn) 05Open>03Resolved a:03Dzahn [20:11:15] @seen Ladsgroup [20:11:16] mutante: I have never seen Ladsgroup [20:11:53] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1748737 (10hoo) [20:16:21] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:16:37] (03PS1) 10Dzahn: add lint:ignore's for remaining files outside modules [puppet] - 10https://gerrit.wikimedia.org/r/293569 [20:16:50] mutante: it's Amir1 in IRC if that's helpful :) [20:17:26] mutante: Amir1 is on a flight / gtting to an airport I think [20:17:30] thcipriani: thanks, yes it is [20:17:50] alright, just wanted to let him know about graphite access [20:20:56] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2369385 (10hoo) db1070 vs. db1068 (different database, cold queries, the fact that the result rows match is... [20:25:17] (03CR) 10Gehel: Move everything Postgres-related out of role::maps::server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293540 (owner: 10MaxSem) [20:29:12] (03CR) 10Gehel: "This makes a lot of sense and adds clarity! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/293540 (owner: 10MaxSem) [20:29:41] (03CR) 1020after4: [C: 031] "I suggested some more rewrite rules on T127224 but this is a good start." [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [20:32:57] (03PS5) 10Paladox: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [20:33:03] !log aaron@tin Synchronized php-1.28.0-wmf.5/includes/user/User.php: c3b1f80a701d61dc57ccac0c8b1dc7daf03fa925 (duration: 00m 29s) [20:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:31] (03PS2) 10MaxSem: Move everything Postgres-related out of role::maps::server [puppet] - 10https://gerrit.wikimedia.org/r/293540 [20:36:04] !log hoo@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata: Update ArticlePlaceholder (without unrelated T136598 fixes this time) (duration: 01m 51s) [20:36:05] T136598: Wikidata master database connection issue - https://phabricator.wikimedia.org/T136598 [20:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:41] Looks good [20:36:48] no obvious fallout this time [20:40:20] !log hoo@tin Synchronized php-1.28.0-wmf.4/extensions/Wikidata: Update ArticlePlaceholder (duration: 01m 54s) [20:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:19] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:44:10] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 5.410 second response time [20:45:48] (03CR) 10Hoo man: [C: 032] Enable ArticlePlaceholder extension in guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291589 (https://phabricator.wikimedia.org/T136517) (owner: 10KartikMistry) [20:46:22] (03Merged) 10jenkins-bot: Enable ArticlePlaceholder extension in guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291589 (https://phabricator.wikimedia.org/T136517) (owner: 10KartikMistry) [20:48:28] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on guwiki (T136517) (duration: 00m 24s) [20:48:29] T136517: Enable ArticlePlaceholder extension in guwiki - https://phabricator.wikimedia.org/T136517 [20:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:25] Looks good on guwiki [20:50:49] (03CR) 10Hoo man: [C: 032] Enable the ArticlePlaceholder on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293525 (https://phabricator.wikimedia.org/T136100) (owner: 10Hoo man) [20:50:59] (03CR) 10Smalyshev: [C: 031] Don't publish etags for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/293492 (https://phabricator.wikimedia.org/T137238) (owner: 10Gehel) [20:51:02] (03CR) 10BBlack: [C: 031] varnishlog4: default to request grouping [puppet] - 10https://gerrit.wikimedia.org/r/293530 (https://phabricator.wikimedia.org/T137114) (owner: 10Ema) [20:51:22] (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293525 (https://phabricator.wikimedia.org/T136100) (owner: 10Hoo man) [20:51:33] (03CR) 10BBlack: [C: 031] Don't publish etags for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/293492 (https://phabricator.wikimedia.org/T137238) (owner: 10Gehel) [20:53:48] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on lvwiki (T136100) (duration: 00m 26s) [20:53:49] T136100: Enable ArticlePlaceholder on lvwiki - https://phabricator.wikimedia.org/T136100 [20:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:49] (03CR) 10Hoo man: [C: 032] Enable the ArticlePlaceholder on nnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293526 (https://phabricator.wikimedia.org/T130997) (owner: 10Hoo man) [20:56:24] (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder on nnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293526 (https://phabricator.wikimedia.org/T130997) (owner: 10Hoo man) [20:57:11] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365971 (10demon) 05Open>03declined Per what I said in T133300#2369730. [20:57:13] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2369740 (10demon) [20:57:17] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on nnwiki (T130997) (duration: 00m 24s) [20:57:18] T130997: [Task] Configure ArticlePlaceholder for nnwiki - https://phabricator.wikimedia.org/T130997 [20:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:09] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [20:59:00] Ok, all verified ok now [20:59:14] Think I'm done with ArticlePlaceholder deploys for today [20:59:20] not even 2 hours late [21:00:10] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.029 second response time [21:19:31] (03PS1) 10Andrew Bogott: Makedomain: append a '.' on the requested domain if needed. [puppet] - 10https://gerrit.wikimedia.org/r/293622 [21:20:39] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:21:01] (03CR) 10BryanDavis: [C: 031] Makedomain: append a '.' on the requested domain if needed. [puppet] - 10https://gerrit.wikimedia.org/r/293622 (owner: 10Andrew Bogott) [21:22:45] (03CR) 10Andrew Bogott: [C: 032] Makedomain: append a '.' on the requested domain if needed. [puppet] - 10https://gerrit.wikimedia.org/r/293622 (owner: 10Andrew Bogott) [21:24:39] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.149 second response time [21:29:38] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2027631 (10Blahma) Just noticed this bug just before opening a new bug report on cswiki_p missing virtually all revisions and categorylinks from between 2016-03-08 18:00 and 21:00 UTC, which spoils the results... [21:29:43] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2369922 (10hashar) From talk we had, contint1001 was setup in emergency since gallium could have been unrecoverable. Turn... [21:40:49] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.336 second response time [21:41:05] !log aaron@tin Synchronized php-1.28.0-wmf.5/includes: 904dd4ae088a8f67942c09b2b28178377955d6a6 (duration: 01m 18s) [21:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:29] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [21:44:49] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.011 second response time [21:45:44] (03Abandoned) 10Hashar: cache_misc: change doc/integration.wm.o backend [puppet] - 10https://gerrit.wikimedia.org/r/293284 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar) [21:54:38] (03CR) 10Catrope: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope) [21:59:54] (03CR) 10Thcipriani: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [22:00:04] tgr: Respected human, time to deploy AuthManager (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T2200). Please do the needful. [22:01:13] (03CR) 10jenkins-bot: [V: 04-1] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [22:01:53] (03CR) 10Thcipriani: [C: 031] "Seems to work well, should be able to integrate this in deployments without issue." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [22:02:15] tgr: train is still blocked, FYI. [22:02:43] thcipriani: any guess when it will end? [22:02:44] thcipriani: https://gerrit.wikimedia.org/r/#/c/293627/ (if hoo is not around) [22:02:58] aude: ooh, hadn't seen that yet. [22:02:59] Looking already [22:03:10] just figured it out and has nothign to do with content translation [22:03:26] i don't know if caches will be broken for an hour [22:03:45] or if we can force expire sites cache [22:05:14] aude: Ok, now it's obvious why it's broken [22:05:44] +2ed [22:06:03] anyone doing SWAT deploys? [22:06:03] :/ [22:06:07] thanks [22:06:18] i was wasting time looking at content translation [22:08:44] kaldari: not at the moment, will likely be deploying https://gerrit.wikimedia.org/r/#/c/293627/ backport and rolling forward wikiversions here in...however long jenkins takes. [22:08:51] 15 minutes? [22:09:30] thcipriani: https://phabricator.wikimedia.org/T133911 [22:09:40] would speed up jenkins ^^ [22:11:27] aude: could you double check me on the backport? https://gerrit.wikimedia.org/r/#/c/293631/ [22:13:31] also need to figure out how to sync it without breaking anything :P [22:14:16] looking [22:14:52] sync-dir should be ok [22:15:23] or sync the DBSiteStore first [22:15:30] errr [22:15:33] otherway [22:15:58] sync the ServiceWiring so it's not calling the method in DBSiteStore [22:16:15] then DBsiteStore [22:18:12] kk [22:18:31] we don't use the file site store in production [22:19:37] gotcha, ok, that seems simple enough, thanks! [22:19:53] if the backport looks good, I'll go ahead and +2, fetch down, and roll forward. [22:20:02] ok [22:20:07] well, fetch down and sync then roll forward. [22:20:21] we may have a fix for ARticlePlaceholder after [22:20:27] but not a blocker [22:20:56] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [22:20:58] k, sounds good. [22:26:04] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:04] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.749 second response time [22:35:01] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/includes/ServiceWiring.php: [[gerrit:293631|Revert "Map dummy language codes in sites"]] Part I (duration: 00m 23s) [22:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:35:38] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/includes/site/DBSiteStore.php: [[gerrit:293631|Revert "Map dummy language codes in sites"]] Part II (duration: 00m 31s) [22:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:09] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider restbase/cassandra for session storage (with SSL) - https://phabricator.wikimedia.org/T134811#2370126 (10aaron) [22:36:24] ^ aude sync'd! [22:36:33] (03Abandoned) 10Aaron Schulz: Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [22:37:08] aude: anything to check in group1, or are we ok to roll forward? [22:39:38] thcipriani: OK, let me know if you decide to go ahead with the SWAT stuff later (looks like it's only config changes) [22:41:05] kaldari: I don't normally run evening SWAT, it'll likely happen in 20 minutes (I hope) [22:44:59] (03PS1) 10Thcipriani: all wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293634 [22:45:01] (03PS6) 10Dzahn: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) [22:45:15] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:46:21] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider restbase/cassandra for session storage (with SSL) - https://phabricator.wikimedia.org/T134811#2370144 (10aaron) [22:46:29] (03CR) 10Thcipriani: [C: 032] all wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293634 (owner: 10Thcipriani) [22:47:04] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293634 (owner: 10Thcipriani) [22:47:16] alright, rolling forward [22:48:12] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.5 [22:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:48:53] tgr: sorry I took up most of your window, but we're all on wmf.5 now. [22:49:23] thx thcipriani [22:49:33] thcipriani: thanks [22:50:10] aude: thank you for the patch! [22:50:15] PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.016 second response time [22:56:25] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:57:53] (03PS2) 10Gergő Tisza: Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 (owner: 10Anomie) [23:00:05] RoanKattouw, ostriches, Krenair, MaxSem, awight, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T2300). Please do the needful. [23:00:05] Dereckson, Eranroz, Dereckson, and matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:28] Present [23:00:31] Hello, I can SWAT this evening, but after the previous deployment window is done. [23:00:34] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.629 second response time [23:01:50] tgr: thcipriani: could I get a ping when you're finished with train or AuthManager? [23:01:55] Dereckson: I'm available also, if you need a backup. [23:02:04] i think train is done, assuming no problems [23:02:09] :) [23:02:18] Dereckson: train is finished. AuthManager may still be in progress... [23:02:48] * aude got my train deployment sms alert :) [23:03:04] Dereckson: will do (jenkins has two more backports to merge so maybe 10-20 min?) [23:03:30] tgr: perfect, happy Zuul/Jenkins waiting time [23:05:21] (03PS3) 10Dereckson: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope) [23:07:03] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [23:07:41] Dereckson: i might have a swat patch for wikidata, if i can prepare it quick enough [23:08:12] aude: okay :) [23:08:26] or longer of the qunit tests continue acting up :( [23:18:32] !log tgr@tin Synchronized php-1.28.0-wmf.5/extensions/ConfirmEdit/FancyCaptcha/resources/ext.confirmEdit.fancyCaptcha.js: deploying [[gerrit:293637]] for AuthManager T135504 (duration: 00m 24s) [23:18:33] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [23:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:53] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:19:55] !log tgr@tin Synchronized php-1.28.0-wmf.5/extensions/MobileFrontend/resources/skins.minerva.special.userlogin.styles/userlogin.less: deploying [[gerrit:293638]] for AuthManager T135504 (duration: 00m 25s) [23:19:56] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [23:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:52] !log tgr@tin Synchronized php-1.28.0-wmf.5/includes/specialpage/LoginSignupSpecialPage.php: deploying [[gerrit:293636]] for AuthManager T135504 (duration: 00m 25s) [23:20:53] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [23:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:25] (03CR) 10Gergő Tisza: [C: 032] Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 (owner: 10Anomie) [23:22:06] (03Merged) 10jenkins-bot: Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 (owner: 10Anomie) [23:22:17] (03PS2) 10Gergő Tisza: Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) [23:23:42] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [23:25:50] (03CR) 10Gergő Tisza: [C: 032] Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [23:26:33] (03Merged) 10jenkins-bot: Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [23:28:15] !log tgr@tin Synchronized dblists/group2.dblist: add dblist for group2 (duration: 00m 22s) [23:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:34] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time [23:29:50] !log tgr@tin Synchronized wmf-config/CommonSettings.php: enable use of group1, group2 dblists in config (duration: 00m 23s) [23:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:05] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: enable AuthManager on group2 wikis T135504 (duration: 00m 24s) [23:31:06] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [23:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:24] anomie: dapatrick: ^ [23:31:32] I see it! [23:33:11] Dereckson: done [23:33:13] Hmm. Login failed first try, redirected me to Login form again with no message. Tried logging in again and it worked. [23:33:35] dapatrick: you need to investigate that? [23:33:38] dapatrick: did you open the login form before the deployment? [23:33:51] tgr Nope. [23:34:01] Cleared cookies and was sitting at the main page. [23:34:37] I opened a private session, tried to login on fr.wikip, worked like a charm. [23:34:37] I'm not freaking out about it. [23:35:03] seems to work for me, can you figure out how to reproduce? [23:36:46] Things are working fine now, after that initial weirdness. I can hit all sites as expected (used repro steps from T136989) [23:36:46] T136989: Enabling two-factor authentication disrupts SUL behavior - https://phabricator.wikimedia.org/T136989 [23:36:57] Trying to reproduce my initial problem now. [23:37:00] okay let's go matt_flaschen, we'll give back the window to tgr and dapatrick if they need a hotfix [23:37:08] (03PS4) 10Dereckson: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope) [23:37:11] Okay [23:37:17] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope) [23:37:55] (03Merged) 10jenkins-bot: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope) [23:38:24] matt_flaschen: live on mw1017 [23:38:30] Dereckson, tgr, seems fine. Thanks! [23:40:08] Dereckson, it doesn't affect testwiki, so I can't test there. [23:40:46] matt_flaschen: you can test where you want with https://wikitech.wikimedia.org/wiki/Debugging_in_production [23:41:24] if you use this extension - https://chrome.google.com/webstore/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb - you can with one click send request to this server [23:42:21] Dereckson, thanks, I forgot about that. [23:42:44] there is also an extension for Firefox: https://addons.mozilla.org/en-US/firefox/addon/wikimedia-debug-header/ [23:46:10] Dereckson: added wikidata patch to the wiki [23:46:18] it will apply to wmf.5 core [23:47:02] Dereckson, looks good. [23:47:11] aude, matt_flaschen, okay [23:47:45] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Remove HiddenPrefs hack for turning off cross-wiki notifications (T135266) (duration: 00m 27s) [23:47:46] T135266: Gate cross-wiki preferences entirely (default off) - https://phabricator.wikimedia.org/T135266 [23:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:36] (03PS2) 10Dzahn: Remove Nik Everett's production access [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) (owner: 10Greg Grossmeier) [23:48:50] (03PS3) 10Dereckson: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope) [23:48:59] (03CR) 10Dereckson: [C: 032] Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope) [23:49:03] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: puppet fail [23:49:54] (03PS3) 10Dzahn: Remove Nik Everett's production access [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) (owner: 10Greg Grossmeier) [23:50:19] (03Merged) 10jenkins-bot: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope) [23:50:45] (03CR) 10Dzahn: [C: 032] Remove Nik Everett's production access [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) (owner: 10Greg Grossmeier) [23:51:26] matt_flaschen: live on mw1017 [23:52:47] Dereckson, looks good. [23:53:22] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow beta feature on frwiki (T136684) (duration: 00m 27s) [23:53:23] T136684: Deploy Flow as a Beta Feature on French Wikipedia - https://phabricator.wikimedia.org/T136684 [23:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:05] I'm skipping Set import sources for he.wikipedia as there is a -1 and no news from Eranroz since. [23:54:26] (03PS1) 10Dzahn: admin: move manybubbles to absented users [puppet] - 10https://gerrit.wikimedia.org/r/293656 (https://phabricator.wikimedia.org/T130113) [23:54:36] (03PS2) 10Dereckson: Revert "Revert "Set Tamil projects to use uca-ta collation"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) [23:54:44] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) (owner: 10Dereckson) [23:55:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:55:47] (03PS2) 10Dzahn: admin: move manybubbles to absented users [puppet] - 10https://gerrit.wikimedia.org/r/293656 (https://phabricator.wikimedia.org/T130113) [23:55:53] (03Merged) 10jenkins-bot: Revert "Revert "Set Tamil projects to use uca-ta collation"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) (owner: 10Dereckson) [23:57:00] works on mw1017 [23:57:31] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set Tamil projects to use uca-ta collation II (T75453) (duration: 00m 25s) [23:57:31] T75453: Tamil sort order - https://phabricator.wikimedia.org/T75453 [23:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:12] aude: you self-merged the fix [23:59:32] only on the branch [23:59:45] oh okay