[00:00:49] jdlrobson: by the way, what about php-1.28.0-wmf.3? [00:01:28] nuria_, is the file mentioned in https://phabricator.wikimedia.org/T136120 supposed to be getting updated every time we make a new wiki? [00:03:37] James_F: could you abort https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/44271/ too please? [00:04:25] mlitn: mediawiki-extensions-qunit tests are failing at this moment [00:05:20] they’re known to intermittently fail, but it can’t be related to this change (which is in some maintenance script, which is not run from anywhere) [00:05:40] a simple “recheck” should do, unless that delays the process even more :) [00:05:42] Yes, no problem, once test are aborted I'll V+2 [00:06:00] cool :) [00:07:42] (03PS1) 10Yuvipanda: Add dh-python to build dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290611 [00:07:45] (03PS1) 10Yuvipanda: Add options to webservice-runner to control proxy registration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290612 [00:07:46] (03PS1) 10Yuvipanda: Debian version bump [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290613 [00:08:12] check_command => 'check_http_url!commons.wikimedia.org/wiki/Main_Page!Picture of the day', [00:08:33] ^ check_command in puppet monitoring::service, which then turns it into icinga config [00:08:55] (how) do i need to escape the whitespace? grr [00:11:46] mlitn: would you have a Jenkins access? If so, could you abort https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/44271/ [00:13:22] aborted [00:13:24] Thanks. [00:16:11] !log dereckson@tin Synchronized /srv/mediawiki-staging/php-1.28.0-wmf.2/extensions/Flow/maintenance/FlowRemoveOldTopics.php: Don't assume workflows/revisions are inserted in chronological order (T119509) (duration: 00m 28s) [00:16:12] T119509: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509 [00:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:58] mlitn: please test on wmf.2 ^ [00:18:12] ok; this’ll take a minute [00:18:36] Fine. If working, I'll manually merge wmf3 then. [00:18:39] (03PS2) 10Dzahn: add icinga monitoring for content on commons [puppet] - 10https://gerrit.wikimedia.org/r/290606 (https://phabricator.wikimedia.org/T124812) [00:18:58] or it could be merged by Zuul according https://integration.wikimedia.org/zuul/ [00:19:31] Dereckson: seems to work, thanks! [00:19:40] You're welcome, thanks for testing. [00:19:54] so for wmf3, we wait https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/4162/console [00:20:22] Dereckson: won’t be able to test wmf.3, though (just need it once wmf.3 rolls out in a couple of days) [00:20:28] 00:20:12 < grrrit-wm> (Merged) jenkins-bot: Don't assume workflows/revisions are inserted in chronological order [extensions/Flow] (wmf/1.28.0-wmf.3) - https://gerrit.wikimedia.org/r/290518 (https://phabricator.wikimedia.org/T119509) (owner: Matthias Mullie) [00:20:32] okay [00:21:34] (03PS3) 10Dzahn: add icinga monitoring for content on commons [puppet] - 10https://gerrit.wikimedia.org/r/290606 (https://phabricator.wikimedia.org/T124812) [00:22:14] !log dereckson@tin Synchronized /srv/mediawiki-staging/php-1.28.0-wmf.3/extensions/Flow/maintenance/FlowRemoveOldTopics.php: Don't assume workflows/revisions are inserted in chronological order (T119509) (duration: 00m 23s) [00:22:15] T119509: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509 [00:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:15] (03PS4) 10Dzahn: add icinga monitoring for content on commons [puppet] - 10https://gerrit.wikimedia.org/r/290606 (https://phabricator.wikimedia.org/T124812) [00:24:05] (03PS5) 10Dzahn: add icinga monitoring for content on commons [puppet] - 10https://gerrit.wikimedia.org/r/290606 (https://phabricator.wikimedia.org/T124812) [00:24:27] (03CR) 10Dzahn: [C: 032] add icinga monitoring for content on commons [puppet] - 10https://gerrit.wikimedia.org/r/290606 (https://phabricator.wikimedia.org/T124812) (owner: 10Dzahn) [00:28:02] MatmaRex: Jenkins failures on https://gerrit.wikimedia.org/r/#/c/290581/, how do you feel about merge it manually? [00:30:54] Dereckson: ugh, it's because https://gerrit.wikimedia.org/r/#/c/290603/ failed [00:30:57] James_F: ^ [00:32:20] how did that take an hour to run [00:32:31] Dereckson: i'm fixing it [00:34:15] Jenkins task to abort: https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/44297/ [00:34:55] Dereckson: https://gerrit.wikimedia.org/r/#/c/290603/ should merge now, and once it merges, https://gerrit.wikimedia.org/r/#/c/290581/ should also pass [00:35:18] k [00:35:22] foks: ^ [00:35:47] foks: so at 03:00 UTC we've an automated job to run l10nupdate, it will pick your l10n change if merged [00:36:32] OK [00:36:43] (ping Jamesofur as well) [00:37:42] 290581? [00:37:45] thanks [00:37:46] aye [00:37:50] OK cool [00:37:58] I've created it temporarily but once it 'should' be out will delete and see if it works :D [00:37:59] lmk if there's anything I need to do [00:38:15] (or will just make foks do it if he's still awake) [00:44:50] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:45:13] that's the one that had the scap pull [00:45:19] an hour ago [00:45:30] mw1140 in SAL [00:46:49] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.042 second response time [00:49:13] if something new pops up about content on commons, that's my new check for T124812 worked manually on neon [00:49:13] T124812: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812 [00:52:27] 06Operations, 10Incident-20160126-WikimediaDomainRedirection, 10Monitoring, 13Patch-For-Review: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#2325165 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?t... [00:53:16] mutante: yeah, mw1140 failed sooner this evening, so some of the SWAT deployments wasn't propagated to mw1140. When it was responsive again, I've scap pull to sync it. [00:53:53] Dereckson: gotcha! should we take it out of rotation for now? [00:54:18] it seems ok now [00:54:30] and it's API [00:54:55] (initial failure icinga warnings at 23:15 and 23:21 UTC, recovery at 23:42 UTC) [00:55:17] the socket timeout can also just be on icinga side [00:55:25] when neon is too busy [00:55:42] scap was unable to ssh to it [00:55:45] ok [00:56:21] ah! i see it [00:56:29] 926 May 24 23:19:00 mw1140 nrpe[4593]: INFO: SSL Socket Shutdown. 927 May 24 23:19:38 mw1140 nrpe[4601]: Error: Could not complete SSL handshake. 5 [00:56:46] if nrpe doesnt work all the other icinga checks will fail [00:58:26] 907 May 24 23:14:16 mw1140 nrpe[3548]: Could not read request from client, bailing out... [00:58:50] puppet-agent[2748]: Could not get latest version: Cannot allocate memory - fork(2) [00:59:12] you know what it is .. it's the hhvm memory leak and the other things are effects of running out of memory.. afaict [00:59:24] restarting hhvm service would fix [00:59:34] i did it on some other servers too [01:00:29] k [01:00:51] (03CR) 10Thcipriani: [C: 031] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [01:14:03] 06Operations, 10Incident-20160126-WikimediaDomainRedirection, 10Monitoring, 13Patch-For-Review: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#2325182 (10Dzahn) So the Icinga part is there now. What i don't know is:... [01:15:08] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:39] foks: Jamesofur: https://gerrit.wikimedia.org/r/#/c/290581/ has finally been successfully merged [01:15:53] \o/ [01:15:56] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:59] \o/ [01:16:00] was that my fault? :( [01:16:05] Not at all [01:16:14] We had multiple failures on Jenkins this night. [01:16:49] ahh I see [01:16:54] As the patches are merged in a queue, one failure blocks the queue. [01:16:56] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:17:12] * foks nods [01:17:37] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:17:56] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:17:57] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:07] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:16] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:27] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:27] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:28] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:46] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:56] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [01:19:33] (03PS4) 10Dzahn: ircecho: add systemd require/after to start after ircd [puppet] - 10https://gerrit.wikimedia.org/r/290588 (https://phabricator.wikimedia.org/T134875) [01:20:27] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1218.31 seconds [01:20:41] [l10nupdate] Submodule path 'WikimediaMessages': checked out 'f6baa894bbac7d468dd9dae1e9a1f9be7ab88dc8' [01:21:29] So yes I confirm the procedure is: 1. merge to master 2. run `l10nupdate` on Tin (or wait 3am UTC) [01:21:46] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:22:58] By the way, wikitech docs says 2am, not 3am [01:24:56] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:26] (03CR) 10Dzahn: [C: 032] ircecho: add systemd require/after to start after ircd [puppet] - 10https://gerrit.wikimedia.org/r/290588 (https://phabricator.wikimedia.org/T134875) (owner: 10Dzahn) [01:37:41] Krinkle: ok, nagf is now back to being almost like a regular tool. You can just 'become nagf' and do stuff in public_html and it'll reflect in the webservice now. [01:37:48] Krenair: you also got a php upgrade along the way [01:40:24] hehe [01:40:48] (03CR) 10Dzahn: "May 25 01:34:15 udpmx-01 systemd[1]: Starting IRCd for Mediawiki RecentChanges feed..." [puppet] - 10https://gerrit.wikimedia.org/r/290588 (https://phabricator.wikimedia.org/T134875) (owner: 10Dzahn) [01:41:08] yurik: nice tab complete ;) [01:41:22] legoktm, wa? [01:41:39] bah [01:41:42] legoktm: haha :P [01:42:01] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2325218 (10Dzahn) so this solved it in so far that after a reboot the next puppet run will start both the IRC server and the ircbot and th... [01:42:09] yurik: sorry, just messing with YuviPanda :P [01:42:29] * yurik files a phab ticket for YuviPanda to do something about tab complete [01:42:43] (03PS2) 10Yuvipanda: Debian version bump [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290613 [01:42:45] (03PS2) 10Yuvipanda: Add options to webservice-runner to control proxy registration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290612 [01:42:45] :P [01:42:47] legoktm: lol [01:42:47] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:42:47] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [01:42:47] (03PS1) 10Yuvipanda: Use only simple stat cache for lighttpd [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290618 [01:43:01] YuviPanda: Thanks, I'll check it out tomorrow. [01:43:09] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2325219 (10Dzahn) puppet just starts the ircecho service and then both are up: Notice: /Stage[main]/Mw_rc_irc::Irc_echo/Service[ircecho]/e... [01:43:19] YuviPanda: does 'like a regular tool' include k8s? [01:43:22] Krinkle: np. this will eventually become --backend=k8s parameter to webservice sometime in the next few weeks [01:43:29] Krinkle: yeah, it is running in k8s [01:43:37] Krinkle: but deployment is like a regular tool - just off NFS [01:43:38] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [01:43:38] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [01:43:48] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [01:43:48] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [01:44:06] Krinkle: so the container starts, mounts NFS then runs off of it. logs are back in access.log and error.log as well [01:44:07] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [01:44:16] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 0 % full [01:44:17] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:44:24] Krinkle: it's the exact same code that runs in the gridengine, except running inside k8s [01:44:26] RECOVERY - DPKG on mw1140 is OK: All packages OK [01:44:27] RECOVERY - Disk space on mw1140 is OK: DISK OK [01:44:36] (03CR) 10Yuvipanda: [C: 032] Add options to webservice-runner to control proxy registration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290612 (owner: 10Yuvipanda) [01:44:47] YuviPanda: Hm.. interesting. So what's different right now between nagf and my other php-based plain tool labs tools? [01:44:58] Krinkle: your other php-based tools are running on gridengine [01:45:05] Krinkle: while nagf doesn't touch gridengine at all [01:45:24] YuviPanda: right, it's registered somewhere special? [01:45:33] Krinkle: yep, there's a k8s manifest [01:45:33] service.manifest is empty [01:45:39] Krinkle: right. [01:45:56] Krinkle: so I did this one by hand, and I'll be adding a k8s backend to 'webservice' over this / next week [01:46:05] cool, no worries [01:46:05] Thanks [01:46:27] (03CR) 10Yuvipanda: [C: 032] Use only simple stat cache for lighttpd [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290618 (owner: 10Yuvipanda) [01:46:39] (03CR) 10Yuvipanda: [C: 032] Debian version bump [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290613 (owner: 10Yuvipanda) [01:48:29] !log dereckson@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 15m 24s) [01:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:48] foks: you can probably delete the strings on meta now [01:49:17] PROBLEM - HHVM rendering on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:18] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:23] will try :3 [01:51:07] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 67562 bytes in 0.109 second response time [01:52:07] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.033 second response time [01:52:17] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.30 seconds [01:52:55] foks: https://commons.wikimedia.org/wiki/MediaWiki:Group-wmf-supportsafety does't seem to be there [01:53:28] Bleh. Maybe needs time? [01:53:36] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:46] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:56] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:56] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:16] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:17] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:27] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:27] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:37] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:47] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:48] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:55:59] mutante: mw1140 is failing again ^ [01:56:11] (03PS1) 10Dzahn: ircecho: set user/group/mode for unit file [puppet] - 10https://gerrit.wikimedia.org/r/290619 [01:59:26] (03PS1) 10Dereckson: Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/290620 (https://phabricator.wikimedia.org/T136157) [02:00:06] !log reboot unresponse mw1140 [02:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:01:40] (03PS2) 10Dereckson: l10nupdate: use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/290620 (https://phabricator.wikimedia.org/T136157) [02:02:17] RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful [02:02:28] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:02:36] RECOVERY - DPKG on mw1140 is OK: All packages OK [02:02:38] RECOVERY - Disk space on mw1140 is OK: DISK OK [02:02:48] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:02:56] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [02:03:37] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [02:03:48] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [02:03:57] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [02:03:57] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [02:03:58] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.828 second response time [02:04:17] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [02:04:18] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 7 % full [02:05:07] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67564 bytes in 1.696 second response time [02:06:46] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [02:09:30] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2325246 (10Dereckson) [02:12:18] (03CR) 10jenkins-bot: [V: 04-1] ircecho: set user/group/mode for unit file [puppet] - 10https://gerrit.wikimedia.org/r/290619 (owner: 10Dzahn) [02:12:56] _now_ you say that jenkins :) [02:23:20] (03PS13) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [02:24:20] (03CR) 10jenkins-bot: [V: 04-1] install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [02:24:56] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:47] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:06] PROBLEM - Apache HTTP on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50408 bytes in 2.459 second response time [02:26:37] PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:47] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 67561 bytes in 0.098 second response time [02:26:57] PROBLEM - HHVM rendering on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.004 second response time [02:27:06] PROBLEM - configured eth on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:13] (03PS14) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [02:27:37] PROBLEM - nutcracker port on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:46] PROBLEM - puppet last run on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:06] PROBLEM - RAID on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:08] PROBLEM - Check size of conntrack table on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:16] PROBLEM - Disk space on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:17] PROBLEM - DPKG on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:46] PROBLEM - HHVM processes on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:46] PROBLEM - nutcracker process on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:56] PROBLEM - SSH on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:26] RECOVERY - nutcracker port on mw1116 is OK: TCP OK - 0.000 second response time on port 11212 [02:29:37] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [02:29:56] RECOVERY - RAID on mw1116 is OK: OK: no RAID installed [02:30:06] RECOVERY - Check size of conntrack table on mw1116 is OK: OK: nf_conntrack is 0 % full [02:30:07] RECOVERY - Disk space on mw1116 is OK: DISK OK [02:30:16] RECOVERY - DPKG on mw1116 is OK: All packages OK [02:30:37] RECOVERY - HHVM processes on mw1116 is OK: PROCS OK: 6 processes with command name hhvm [02:30:37] RECOVERY - nutcracker process on mw1116 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:30:47] RECOVERY - SSH on mw1116 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [02:30:59] RECOVERY - configured eth on mw1116 is OK: OK - interfaces up [02:31:58] (03CR) 10BryanDavis: "See also I6fe9218097ec91dc4f32ad3b75a7ad2fd9ad9d16" [puppet] - 10https://gerrit.wikimedia.org/r/290620 (https://phabricator.wikimedia.org/T136157) (owner: 10Dereckson) [02:34:54] (03PS15) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [02:35:47] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 15 failures [02:36:22] (03CR) 10jenkins-bot: [V: 04-1] install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [02:41:36] PROBLEM - configured eth on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:47] PROBLEM - HHVM processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:58] PROBLEM - SSH on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:07] PROBLEM - DPKG on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:17] PROBLEM - nutcracker port on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:26] PROBLEM - dhclient process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:56] PROBLEM - puppet last run on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:56] PROBLEM - RAID on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:28] PROBLEM - nutcracker process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:28] PROBLEM - salt-minion processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:46] PROBLEM - Check size of conntrack table on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:46] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:56] PROBLEM - Disk space on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:06] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:56] PROBLEM - nutcracker process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:06] PROBLEM - Check size of conntrack table on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:26] PROBLEM - puppet last run on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:27] PROBLEM - HHVM processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:46] PROBLEM - configured eth on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:57] PROBLEM - RAID on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:07] PROBLEM - DPKG on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:26] PROBLEM - SSH on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:36] PROBLEM - nutcracker port on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:38] PROBLEM - salt-minion processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:58] PROBLEM - dhclient process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:53:07] PROBLEM - Disk space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:37] RECOVERY - SSH on mw1115 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:00:37] RECOVERY - nutcracker port on mw1115 is OK: TCP OK - 0.000 second response time on port 11212 [03:01:28] RECOVERY - Last backup of the others filesystem on labstore1001 is OK: OK - Last run for unit replicate-others was successful [03:08:37] PROBLEM - SSH on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:09:16] PROBLEM - nutcracker port on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:25] RECOVERY - salt-minion processes on mw1117 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:09:45] RECOVERY - SSH on mw1117 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:09:46] RECOVERY - dhclient process on mw1117 is OK: PROCS OK: 0 processes with command name dhclient [03:11:25] RECOVERY - nutcracker process on mw1115 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:11:36] RECOVERY - Disk space on mw1115 is OK: DISK OK [03:13:05] RECOVERY - Disk space on mw1117 is OK: DISK OK [03:15:05] RECOVERY - nutcracker port on mw1115 is OK: TCP OK - 0.000 second response time on port 11212 [03:15:16] PROBLEM - salt-minion processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:36] PROBLEM - SSH on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:46] PROBLEM - dhclient process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:56] PROBLEM - Disk space on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:16] PROBLEM - nutcracker process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:36] PROBLEM - Disk space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:36] RECOVERY - dhclient process on mw1115 is OK: PROCS OK: 0 processes with command name dhclient [03:21:26] RECOVERY - Disk space on mw1115 is OK: DISK OK [03:22:05] RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:26:35] PROBLEM - dhclient process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:05] PROBLEM - salt-minion processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:46] PROBLEM - nutcracker port on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:33:36] PROBLEM - Disk space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:25] PROBLEM - HHVM processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:25] PROBLEM - nutcracker port on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:25] PROBLEM - Check size of conntrack table on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:55] PROBLEM - salt-minion processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:54:26] RECOVERY - dhclient process on mw1115 is OK: PROCS OK: 0 processes with command name dhclient [04:00:26] PROBLEM - dhclient process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:55] PROBLEM - NTP on mw1115 is CRITICAL: NTP CRITICAL: No response from NTP server [04:11:17] RECOVERY - dhclient process on mw1117 is OK: PROCS OK: 0 processes with command name dhclient [04:14:36] RECOVERY - nutcracker process on mw1115 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:14:36] RECOVERY - DPKG on mw1115 is OK: All packages OK [04:15:35] RECOVERY - RAID on mw1115 is OK: OK: no RAID installed [04:15:36] RECOVERY - NTP on mw1115 is OK: NTP OK: Offset -0.009279131889 secs [04:15:36] RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:15:55] RECOVERY - configured eth on mw1115 is OK: OK - interfaces up [04:15:56] RECOVERY - SSH on mw1115 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [04:16:07] RECOVERY - dhclient process on mw1115 is OK: PROCS OK: 0 processes with command name dhclient [04:16:16] RECOVERY - nutcracker port on mw1115 is OK: TCP OK - 0.000 second response time on port 11212 [04:16:16] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.916 second response time [04:16:26] RECOVERY - Check size of conntrack table on mw1115 is OK: OK: nf_conntrack is 0 % full [04:16:57] RECOVERY - Disk space on mw1115 is OK: DISK OK [04:17:16] PROBLEM - dhclient process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:17:16] RECOVERY - HHVM processes on mw1115 is OK: PROCS OK: 6 processes with command name hhvm [04:17:17] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 67547 bytes in 0.346 second response time [04:17:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 200, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [04:18:15] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [04:18:56] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [04:40:30] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [04:40:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 202, down: 0, dormant: 0, excluded: 0, unused: 0 [04:49:51] PROBLEM - NTP on mw1117 is CRITICAL: NTP CRITICAL: No response from NTP server [04:57:20] (03PS2) 10Dzahn: ircecho: set user/group/mode for unit file [puppet] - 10https://gerrit.wikimedia.org/r/290619 [05:01:40] (03PS16) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [05:05:06] (03PS17) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [05:05:55] (03PS18) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [05:30:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6501199 keys - replication_delay is 0 [05:34:26] RECOVERY - NTP on mw1117 is OK: NTP OK: Offset -0.1170092821 secs [05:49:35] RECOVERY - salt-minion processes on mw1117 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:49:35] RECOVERY - HHVM processes on mw1117 is OK: PROCS OK: 5 processes with command name hhvm [05:49:55] RECOVERY - Check size of conntrack table on mw1117 is OK: OK: nf_conntrack is 0 % full [05:50:05] RECOVERY - RAID on mw1117 is OK: OK: no RAID installed [05:50:15] RECOVERY - nutcracker port on mw1117 is OK: TCP OK - 0.000 second response time on port 11212 [05:50:24] RECOVERY - SSH on mw1117 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [05:50:35] RECOVERY - dhclient process on mw1117 is OK: PROCS OK: 0 processes with command name dhclient [05:50:46] RECOVERY - DPKG on mw1117 is OK: All packages OK [05:50:55] RECOVERY - nutcracker process on mw1117 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:51:05] RECOVERY - Disk space on mw1117 is OK: DISK OK [05:51:05] RECOVERY - configured eth on mw1117 is OK: OK - interfaces up [05:53:06] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:02:35] (03Abandoned) 10Gehel: WIP experiments, just keeping that safe somewhere... [puppet] - 10https://gerrit.wikimedia.org/r/288691 (owner: 10Gehel) [06:30:35] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2325428 (10tstarling) >>! In T86096#965842, @Reedy wrote: > Yup, updateCollation.php does need running when we've upgraded :) > > I think we need to run it e... [06:31:04] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] (03CR) 10Muehlenhoff: "The rules for deployment are added in the dependent patch (290421)" [puppet] - 10https://gerrit.wikimedia.org/r/290422 (owner: 10Muehlenhoff) [06:31:45] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:51] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:29] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:30] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:40] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [06:43:00] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:47:11] (03PS2) 10Muehlenhoff: Enable base::firewall for new snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/290422 [06:49:54] (03PS5) 10Mobrovac: Change prop: Add the rule for MobileApps re-renders [puppet] - 10https://gerrit.wikimedia.org/r/286847 [06:56:41] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:40] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:58:30] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:40] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:10] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:03:00] (03CR) 10jenkins-bot: [V: 04-1] Enable base::firewall for new snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/290422 (owner: 10Muehlenhoff) [07:10:41] (03PS2) 10Muehlenhoff: Enable base::firewall in the pool counter role [puppet] - 10https://gerrit.wikimedia.org/r/290405 [07:28:38] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 135957 MB (3% inode=99%) [07:30:32] 150 GB runjobs [07:30:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall in the pool counter role [puppet] - 10https://gerrit.wikimedia.org/r/290405 (owner: 10Muehlenhoff) [07:30:46] 3TB archive [07:31:30] on archive it is mostly api and runjobs [07:32:29] can I delete older files from both? [07:33:21] e.g. < 201604 [07:37:32] lots of INFO, no w WARNINGs or ERRORs there [07:38:56] <_joe_> jynus: for runjobs, yes [07:39:15] (03CR) 10Alexandros Kosiaris: [C: 032] Change prop: Add the rule for MobileApps re-renders [puppet] - 10https://gerrit.wikimedia.org/r/286847 (owner: 10Mobrovac) [07:39:21] (03PS6) 10Alexandros Kosiaris: Change prop: Add the rule for MobileApps re-renders [puppet] - 10https://gerrit.wikimedia.org/r/286847 (owner: 10Mobrovac) [07:39:37] (03CR) 10Alexandros Kosiaris: [V: 032] Change prop: Add the rule for MobileApps re-renders [puppet] - 10https://gerrit.wikimedia.org/r/286847 (owner: 10Mobrovac) [07:41:48] !log rm runJobs.log-20160[1-3]*.gz on fluorine archive log [07:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:42:37] RECOVERY - Disk space on fluorine is OK: DISK OK [07:42:38] 81%, that sould be enough [07:43:06] 64k should be enough for everyone [07:55:29] jynus: hi. We need to run bunch of queries on stat1002, do you know any documentation on how-to do it? [07:55:44] jynus: they are mostly for wikishared DB [07:56:03] (I have access to stat1002, now. Thanks for that!) [07:56:23] let me see, I am not a user of that [07:56:26] Queries run fine on terbium, and need to port to stat1002 [07:58:26] it should be something like "mysql -h analytics-slave", I think, but let me see how credentials are handled [07:59:14] OK. Let me check with analytics team. [07:59:27] kart_, https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Data_sources [07:59:31] it is documented there [08:03:42] 06Operations, 10ops-eqiad, 10DBA: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2325624 (10jcrespo) These updates, by primary key, took up to 14 second on its master, which is crazy and makes no sense (row locking?): ``` UPDATE /* LinksUpdate::updateLinksTimestamp 127.0.0.1 */ `page` SET page... [08:08:00] !log installing php5 security updates on trusty systems [08:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:09:37] jynus: thanks [08:10:05] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 643 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6513560 keys - replication_delay is 643 [08:12:52] 06Operations, 10ops-eqiad, 10DBA: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2325632 (10Volans) @jcrespo: On it's master (db1042) the battery is ok but there is a disk in predictive failure, although the policy is still WriteBack. I'm not sure it could explain this given that all writes go... [08:14:25] 06Operations, 06Discovery, 10Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2325633 (10Gehel) replication-osm is now running (and taking time catching up). postgresql access has been fixed, initial data import c... [08:14:46] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 1 failures [08:15:15] PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: Puppet has 1 failures [08:17:45] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:18:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:19:57] PROBLEM - puppet last run on mw1024 is CRITICAL: CRITICAL: Puppet has 1 failures [08:20:00] 06Operations, 10ops-eqiad, 10DBA: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2325642 (10jcrespo) It is probably the lock: ``` 30 15 9 294 db1042 wikiuser commonswiki UPDATE /* LinksUpdate::updateLinksTimestamp 127.0.0.1 */ `page` SET page_links_updated = '20160525072346' WHERE page_id = '4... [08:21:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:22:28] 06Operations, 10ops-eqiad, 10DBA: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2325644 (10Volans) Yes, from few random checks they are at the same time for the same ID (select get lock and update) [08:22:51] pageviews, it seems [08:24:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:25:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:25:36] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:27:06] !log restarting apache on uranium [08:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:27:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 700 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6503921 keys - replication_delay is 700 [08:33:31] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/290455 (owner: 10Filippo Giunchedi) [08:33:39] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [08:34:57] (03PS1) 10Elukey: Add support for Spark Dynamic Resource allocation. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) [08:36:25] (03PS2) 10Elukey: Add support for Spark Dynamic Resource allocation. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) [08:39:07] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6503879 keys - replication_delay is 0 [08:39:24] !log perform second schema change on x1 wikis for T135699 [08:39:25] T135699: Schema changes for Echo moderation - https://phabricator.wikimedia.org/T135699 [08:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:41:37] RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:42:05] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2325653 (10fgiunchedi) I've removed old metrics for restbase200[356] and we're at ~520G used on the new machines [08:42:17] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:44:27] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:46:25] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/2900/ looks good, the new settings show up correctly in the Spark defaults." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [08:57:14] (03PS1) 10Mobrovac: Math: Enable MathML on all wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290644 (https://phabricator.wikimedia.org/T131177) [09:00:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] uwsgi: include app name in syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290454 (owner: 10Filippo Giunchedi) [09:02:25] !log upload php5_5.3.10-1ubuntu3.23+wmf1 on apt.wikimedia.org/precise-wikimedia [09:02:30] moritzm: ^ [09:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:52] !log Apply grant for tendril on labservices1002 T106303 [09:02:54] T106303: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303 [09:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:08:08] ack, thanks [09:08:40] Hello. [09:09:55] Jamesofur: fyi I've rescheduled l10n change: l10nupdate backported successfully to wmf2 and wmf3, but didn't scap that https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=567336&oldid=567334 If foks isn't available for SWAT, I guess you or me can test if the message is well propagated after. [09:11:16] Dereckson: do you know why scap is needed? [09:11:16] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:29] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:29] thcipriani|afk: For morning SWAT, if you see unexpected l10n commits during a backport for an extension, that's the yesterday l10nupdate ^ [09:11:51] Nikerabbit: no, because it rebuilds the l10n cache [09:11:54] rebuilt [09:12:17] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:19] It is supposed to also deploy the cache [09:12:46] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:12:47] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:56] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:07] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:17] Nikerabbit: in /var/lib/l10nupdate/mediawiki/extensions/WikimediaMessages all seems ok [09:13:32] (03CR) 10Filippo Giunchedi: uwsgi: include app name in syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290454 (owner: 10Filippo Giunchedi) [09:13:38] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:38] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:47] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:47] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:53] Nikerabbit: but I've checked on mw1017 l10n files weren't synced [09:14:10] did you check l10nupdate log for errors? [09:14:18] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:04] dunno if https://phabricator.wikimedia.org/T125992 is still unfixed [09:15:05] (03PS3) 10Elukey: Add support for Spark Dynamic Resource allocation. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) [09:16:07] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:16:20] Nikerabbit: if a failure occurs, it reports it: https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/l10nupdate-1 [09:16:29] anybody already looking into mw1114 ? [09:16:32] yes me [09:16:40] super :) [09:16:41] <_joe_> it's an OOM again [09:16:42] it's going for a reboot it seems [09:16:49] I can't login as root [09:17:01] and I see nothing of interesting in the console [09:17:29] only the login prompt (which apart from not prompting for a password ever, seems to work) [09:17:46] Nikerabbit: actually, it didn't reach $BINDIR/dologmsg "!log $LOGNAME@$HOSTNAME ResourceLoader cache refresh completed at $(date -ud @$ENDED) (duration $DURATION)" [09:17:53] according the SAL [09:18:43] we had issue yesterday with mw1114 [09:18:46] perhaps it failed at /usr/local/bin/foreachwiki extensions/WikimediaMaintenance/refreshMessageBlobs.php [09:19:19] !log powercycling mw1114 [09:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:28] PROBLEM - dhclient process on mw1114 is CRITICAL: Timeout while attempting connection [09:19:44] akosiaris: mutante rebooted it yesterday evening too by they way, it's the third failure [09:21:37] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [09:21:47] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:21:47] RECOVERY - DPKG on mw1114 is OK: All packages OK [09:21:57] RECOVERY - Disk space on mw1114 is OK: DISK OK [09:22:08] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 43 minutes ago with 0 failures [09:22:16] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [09:22:38] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [09:22:46] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [09:22:56] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:23:06] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [09:23:16] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.419 second response time [09:23:18] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [09:23:27] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 66994 bytes in 0.364 second response time [09:23:27] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 3 % full [09:24:39] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1464173 (10Volans) I've applied the tendril grant from `/etc/mysql/production-grants.sql` (and only that one) required to have tendril monitor this host and added t... [09:25:01] Dereckson: thanks, he's more likely to be available then I am for the morning SWAT but I should be semi around (at least able to test from phone) 8am is early for me but after he starts his normal hours [09:26:18] !log mw1114's logs and ganglia indicate OOM error. [09:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:52] _joe_: you were obviously right. It's an OOM. However OOM killer seems to not have shown up, instead the box froze [09:27:41] it's definitely leaking memory https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&h=mw1114.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=mem_report&c=API+application+servers+eqiad [09:35:02] (03CR) 10Alexandros Kosiaris: [C: 031] uwsgi: include app name in syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290454 (owner: 10Filippo Giunchedi) [09:40:20] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2325726 (10Volans) Updated with all the new to-be-configured ones, re-checked everything with racktables. [09:42:14] !log restarted gmetad on uranium to test if new memcached metrics would be picked up (T129963) [09:42:15] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [09:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:58] (03PS2) 10Jcrespo: Apply firewall to db1016 [puppet] - 10https://gerrit.wikimedia.org/r/290257 (https://phabricator.wikimedia.org/T135973) [09:59:06] Jamesofur: ok [09:59:29] * Jamesofur also sent him an email so that he knows about it [10:10:31] (03CR) 10Jcrespo: [C: 032] Apply firewall to db1016 [puppet] - 10https://gerrit.wikimedia.org/r/290257 (https://phabricator.wikimedia.org/T135973) (owner: 10Jcrespo) [10:14:19] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2325809 (10fgiunchedi) re: python-statsd I've uploaded the 3.1.2-2 to unstable and will backport once it has migrated to testing (cc @hashar) [10:16:09] Nikerabbit: l10nupdate yesterday log: l10n merge: 99% (ok: 391; fail: 0; left: 1) Timeout, server mw1115.eqiad.wmnet not responding. [10:16:12] scap-cdb-rebuild: 99% (ok: 350; fail: 1; left: 1) [10:16:15] and it ends there [10:17:10] (03PS1) 10Jcrespo: Revert "Apply firewall to db1016" [puppet] - 10https://gerrit.wikimedia.org/r/290648 [10:18:46] (03PS2) 10Jcrespo: Revert "Apply firewall to db1016"- apply it on a different entry [puppet] - 10https://gerrit.wikimedia.org/r/290648 [10:19:43] 06Operations, 06Project-Admins, 10Traffic, 07HTTPS: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#2325850 (10Danny_B) [10:19:55] 06Operations, 10Ops-Access-Requests, 06Project-Admins: Create "ops-access-requests" project - https://phabricator.wikimedia.org/T84867#2325853 (10Danny_B) [10:20:18] (03CR) 10Jcrespo: [C: 032] Revert "Apply firewall to db1016"- apply it on a different entry [puppet] - 10https://gerrit.wikimedia.org/r/290648 (owner: 10Jcrespo) [10:32:59] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 71 failures [10:34:52] gilles: godog: paravoid: sorry I have yet to digest/reply to the Thumbor packaging task https://phabricator.wikimedia.org/T134485 :( [10:39:37] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2325895 (10Joe) >>! In T86096#2325428, @tstarling wrote: >>>! In T86096#965842, @Reedy wrote: >> Yup, updateCollation.php does need running when we've upgrade... [10:43:44] (03CR) 10Filippo Giunchedi: [C: 04-1] "a missing guard for scap3 on the config directory, generally LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [10:44:37] hashar: no worries, I CC'd you for python-statsd [10:45:04] I never managed to catch up with the mails for the package I got uploaded to debian [10:45:23] seems they are only send to the python module team list which is quite spammy [10:48:59] they should be sent also to packages.qa.debian.org and tracker.debian.org I think, you can subscribe there to packages [10:49:26] ohh [10:49:55] that later site ask me about a client certificate for some reason ... [10:53:13] subscribed! thx godog [11:02:02] !log restbase deploy start of 8f39fa4 [11:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:12:30] !log restbase deploy end of 8f39fa4 [11:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:15:38] tendril down [11:15:39] its back [11:22:48] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: Puppet has 1 failures [11:27:00] (03PS2) 10Mobrovac: service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 [11:27:50] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2326032 (10Joe) As usual, one shouldn't base his evaluations on "number of wikis". I extracted the total number of rows in the categorylinks tables for the va... [11:27:57] (03CR) 10jenkins-bot: [V: 04-1] service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [11:28:37] PROBLEM - HHVM rendering on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:28:53] (03PS3) 10Mobrovac: service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 [11:28:57] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:29:45] (03CR) 10jenkins-bot: [V: 04-1] service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [11:29:49] really? [11:29:51] uf [11:30:28] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 67015 bytes in 0.236 second response time [11:30:49] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.063 second response time [11:31:18] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [11:33:47] (03PS4) 10Mobrovac: service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 [11:45:31] (03CR) 10Mobrovac: service::node: Prepare for scap3 config deploys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [11:48:17] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:04:10] jan_drewniak: Jdrewniak [12:04:31] Dereckson: hi there [12:04:46] Hello. Could you edit https://phabricator.wikimedia.org/T136019 so it makes clear MarcoAurelio wanted a stat update and not only to deploy recently merged changes? [12:05:02] I think I misinterpreted the request. [12:05:38] and by the way, we've so 3 undeployed portal changes, it would be a good idea to ask a submodule update at the next SWAT. [12:06:53] Undeployed merged code isn't a best practice: we should have a match between what's in the repo and what's is deployed [12:08:13] (03PS1) 10Jcrespo: Reintroduce es200[1234] in puppet, without specific roles [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) [12:10:44] (03CR) 10Muehlenhoff: Reintroduce es200[1234] in puppet, without specific roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) (owner: 10Jcrespo) [12:10:55] (03PS2) 10Jcrespo: Reintroduce es200[1234] in puppet, without specific roles [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) [12:12:19] Dereckson: I will update the submodule and schedule the SWAT. Regarding the 'stats update' MarcoAurelio filed this ticket https://phabricator.wikimedia.org/T136020 to request an update to this page https://meta.wikimedia.org/wiki/Module:Project_portal/views I think that's what he meant by a 'stats update'. [12:12:42] k [12:12:45] (03CR) 10Alex Monk: "Doesn't seem to work properly, compare "telnet rcs1001.eqiad.wmnet 6379" on tin to silver. silver gets stuck for a minute trying the IPv6." [puppet] - 10https://gerrit.wikimedia.org/r/290504 (owner: 10Dzahn) [12:12:47] (03CR) 10Jcrespo: "But they are not spare hosts, they are in use, just they are data holders with no specific needs. What clases do these need?" [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) (owner: 10Jcrespo) [12:16:44] Dereckson: The recently merged changes -- what will be deployed today -- basically amount to just updating the stats on the portal pages, so I think that addresses this ticket: https://phabricator.wikimedia.org/T136019 The other changes in these commits are minor (I reverted a large change), but thanks for pointing out the best practice regarding undeployed [12:16:45] merged code. [12:18:20] (03CR) 10Alex Monk: "What about the PTR?" [dns] - 10https://gerrit.wikimedia.org/r/290542 (owner: 10Dzahn) [12:20:15] (03PS3) 10Jcrespo: Reintroduce es200[1234] in puppet, without specific roles [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) [12:22:39] (03CR) 10Jcrespo: ""debdeploy server groups" where is that on puppet? These server are already on the mysql cluster by the hiera regex." [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) (owner: 10Jcrespo) [12:25:49] (03PS4) 10Jcrespo: Reintroduce es200[1234] in puppet, without specific roles [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) [12:32:08] (03CR) 10Muehlenhoff: "Please disregard my comment, the Salt grains for es* are actually assigned via hieradata/regex.yaml instead of hieradata/role, so this wil" [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) (owner: 10Jcrespo) [12:43:48] (03CR) 10Volans: [C: 031] "Looks sane to me to be in a minimal frozen state." [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) (owner: 10Jcrespo) [12:50:19] (03CR) 10Jcrespo: [C: 032] "Looks good: https://puppet-compiler.wmflabs.org/2905/" [puppet] - 10https://gerrit.wikimedia.org/r/290665 (https://phabricator.wikimedia.org/T134755) (owner: 10Jcrespo) [12:56:11] (03PS2) 10Filippo Giunchedi: uwsgi: include app name in syslog [puppet] - 10https://gerrit.wikimedia.org/r/290454 [12:56:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] uwsgi: include app name in syslog [puppet] - 10https://gerrit.wikimedia.org/r/290454 (owner: 10Filippo Giunchedi) [13:01:18] (03PS2) 10Filippo Giunchedi: graphite: split uwsgi logs to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290455 [13:09:25] (03PS1) 10DCausse: Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 [13:10:18] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2326207 (10matmarex) Note that a few smaller wikis were just added to that list per T75453. [13:10:56] (03CR) 10DCausse: [C: 04-1] "should not be merged now." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (owner: 10DCausse) [13:11:52] (03PS2) 10DCausse: Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) [13:13:11] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2326220 (10matmarex) Actually, nevermind that… [13:13:17] PROBLEM - Host es2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:23] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2326221 (10Joe) @matmarex I have based my evaluation on the latest version of InitializeSettings.php, so I think my figures are pretty accurate. [13:13:28] RECOVERY - Host es2001 is UP: PING OK - Packet loss = 0%, RTA = 37.58 ms [13:13:57] jynus: firewall got applied? ^^^ :-) [13:14:09] reboot after upgrade [13:14:13] _joe_: nevermind. someone quite goofed up with tawiki… wanna revert and deploy a config patch for me? [13:14:20] https://ta.wikipedia.org/wiki/பகுப்பு:பயனர்_ta [13:14:23] all category pages are broken. [13:14:30] https://gerrit.wikimedia.org/r/290529 [13:14:42] <_joe_> MatmaRex uh that's bad [13:14:58] the error is "MediaWiki does not support ICU locale "ta"" [13:15:01] which it doesn't [13:15:13] hu hu [13:15:17] (03PS1) 10Urbanecm: Changetags should be granted only to sysops and bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) [13:15:31] _joe_: can I fix that reverting a change I merged yesterday during SWAT? [13:15:32] <_joe_> so let's revert? [13:15:32] this requires a small (hopefully) core patch first to allow it [13:15:51] <_joe_> Dereckson: yes, I can deploy it ofc [13:15:52] (03PS1) 10Rush: tools.checker continually watch for webservices [puppet] - 10https://gerrit.wikimedia.org/r/290681 (https://phabricator.wikimedia.org/T136162) [13:17:43] (03PS2) 10Urbanecm: Changetags should be granted only to sysops and bots in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) [13:17:46] (03PS1) 10Dereckson: Revert "Set Tamil projects to use uca-ta collation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290682 [13:17:48] _joe_: here you are ^ [13:19:03] <_joe_> Dereckson: ack! [13:20:48] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Set Tamil projects to use uca-ta collation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290682 (owner: 10Dereckson) [13:21:07] So MatmaRex I offer at a next SWAT we backport the change to support ta on MediaWiki and then revert the revert [13:21:51] IS there such a change? [13:21:58] not yet [13:22:25] sure. i don't really have time to help at the moment, sorry :( [13:23:19] bawolff: we need a first-letters-uca-ta.ser precompiled data [13:23:31] no, you probably don't [13:24:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 606 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6526543 keys - replication_delay is 606 [13:24:43] You probably need to just add stuff to $tailoringFirstLetters [13:24:57] Unless there is something very weird with how ta works [13:25:03] relative to the other collations [13:25:07] <_joe_> uhm sorry guys, I am having some issues with the new scap [13:25:46] _joe_: scap sync-file wmf-config/InitialiseSettings.php "message" [13:25:58] <_joe_> Dereckson: that's what I just did... [13:26:07] (03PS1) 10Jdrewniak: T136019 updating portal stats. Bumping portals to master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290683 (https://phabricator.wikimedia.org/T136019) [13:26:08] <_joe_> seems to be a permissions problem [13:26:13] (03CR) 10Ottomata: "3 inline comments." (033 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [13:28:22] I am going to try that^ [13:28:54] _joe_: seems on tin there is already some scap processes running, for a sync l10n [13:29:04] yeah, same thing [13:29:16] ps auxw | grep mwdeploy [13:29:38] /usr/bin/python /usr/bin/scap sync-l10n [13:29:55] root is doing that [13:30:39] <_joe_> Dereckson: yes killing it [13:30:49] probably stucked with some non responsive mwxxxx [13:31:36] <_joe_> Dereckson: mw1117, to be precise [13:31:53] !log oblivian@tin Synchronized wmf-config/InitialiseSettings.php: revert uca-ta collation (duration: 00m 31s) [13:31:56] <_joe_> Dereckson: the fact that one unresponsive server leaves scap hanging is obviously a grave bug [13:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:40] thanks _joe_ [13:34:04] <_joe_> yw [13:34:23] <_joe_> MatmaRex: categories are back [13:36:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6502028 keys - replication_delay is 0 [13:40:24] (03CR) 10Filippo Giunchedi: "I've now committed the private keys to private.git and generated the public keys (five keys in total)" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [13:42:28] <_joe_> bawolff: the wikis that will be affected will be ~ 90 [13:42:34] <_joe_> not 20 :) [13:42:47] (03PS3) 10Filippo Giunchedi: graphite: split uwsgi logs to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290455 [13:43:02] _joe_: This just proves I should never take guesses at numbers without checking them ;) [13:43:46] <_joe_> bawolff: if you want numbers, my last comments on https://phabricator.wikimedia.org/T86096 have plenty of them [13:44:15] Yeah, I actually read that after sending my email, and was oh whoops... [13:44:22] <_joe_> hehe [13:44:57] <_joe_> admittedly, most of the wikis on s3 that are affected are either very small or have a small list of categorylinks [13:45:25] <_joe_> frwiki is the major potential offender [13:45:35] <_joe_> and btw, we're not sure the collation will be fucked up [13:46:02] fr.wikipedia is on s6 now [13:46:07] <_joe_> I looked at the changelogs and while in the past there were breaking changes, it doesn't seem like this was the case [13:46:11] <_joe_> Dereckson: yes [13:46:26] <_joe_> Dereckson: and accounts for 33.4 M rows by itself :) [13:50:02] if you want to test to see if it would be, you could do something like echo bin2hex(Collation::factory( 'uca-fr' )->getSortKey( 'Foo' )); in eval.php from both the new version and the old version and see if the results are the same [13:50:11] (03PS1) 10Dereckson: Revert "Revert "Set Tamil projects to use uca-ta collation"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) [13:50:50] If you get the same results, definitely won't cause any problems. If you get different results, they will probably mess up the categories temporarily [13:51:02] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/2903/" [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [13:52:28] (03PS3) 10Volans: Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [13:54:24] (03PS4) 10Volans: Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [13:55:49] (03PS4) 10Filippo Giunchedi: graphite: split uwsgi logs to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290455 [13:55:51] (03PS6) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [13:55:53] (03PS2) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [14:00:43] (03PS4) 10Elukey: Add support for Spark Dynamic Resource allocation. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) [14:05:12] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2326297 (10Cmjohnson) @elukey @ottomata I have the disk on-site let me know when you're available to coordinate the replacement. I am 99% cert it's slot 5. [14:06:06] (03PS5) 10Elukey: Add support for Spark Dynamic Resource allocation. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) [14:06:26] cmjohnson1: now is good [14:06:46] _joe_: I guess this all raises a bigger question though - is our collation stuff scalable enough to use it in general [14:07:07] Like imagine if every single wiki used it (Esp. enwiki) [14:07:18] This update would be a bit of a nightmere in that case [14:07:57] <_joe_> bawolff: well it would mean we'd have a weekend of potential inconsistencies [14:08:05] (03PS4) 10Muehlenhoff: Install firejail on image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/288379 (https://phabricator.wikimedia.org/T135111) [14:08:07] <_joe_> but yes the process must be repeatable [14:08:23] (03CR) 10Muehlenhoff: [C: 032 V: 032] Install firejail on image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/288379 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [14:10:25] 06Operations, 10ops-eqiad: db1034 degraded RAID - https://phabricator.wikimedia.org/T135353#2326329 (10Cmjohnson) disk swapped and is rebuilding Enclosure Device ID: 32 Slot Number: 5 Drive's position: DiskGroup: 0, Span: 2, Arm: 1 Enclosure position: N/A Device Id: 5 WWN: 5000C5003240B064 Sequence Number: 11... [14:10:32] (03CR) 10Ottomata: [C: 031] "Just some typos! Feel free to merge after fixing those." (032 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [14:10:57] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2326335 (10Cmjohnson) [14:11:13] cmjohnson1: lemme know if you are avail and I will write some crazy bytes :) [14:11:14] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2293311 (10Cmjohnson) [14:13:46] (03PS6) 10Elukey: Add support for Spark Dynamic Resource allocation. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) [14:13:56] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [14:14:02] Dereckson: I'm going to upload a new patchset over top of yours for the tamil thing, if that's ok [14:14:17] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Puppet has 1 failures [14:14:38] 06Operations, 10ops-eqiad: db1023 Degraded RAID - https://phabricator.wikimedia.org/T135157#2326361 (10Cmjohnson) The disk at slot 8 has been swapped and is rebuilding. Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware sta... [14:14:41] (03CR) 10Elukey: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/2907/analytics1032.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [14:14:51] (03CR) 10Elukey: [C: 032 V: 032] "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/2907/analytics1032.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290643 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [14:15:45] 06Operations, 10ops-eqiad, 10DBA: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2326379 (10Cmjohnson) @volans @jcrespo: LMK when you can take db1056 down and I will swap out the BBU [14:17:17] (03PS1) 10Elukey: Update the cdh module to the last version - Adding Spark Dynamic allocation support. [puppet] - 10https://gerrit.wikimedia.org/r/290692 (https://phabricator.wikimedia.org/T101343) [14:17:42] (03CR) 10Elukey: [C: 032 V: 032] Update the cdh module to the last version - Adding Spark Dynamic allocation support. [puppet] - 10https://gerrit.wikimedia.org/r/290692 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [14:18:52] (03PS1) 10Eevans: enable instance restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/290694 (https://phabricator.wikimedia.org/T134016) [14:20:24] AaronSchulz: yt? [14:20:41] (03PS1) 10Jcrespo: Depool db1056 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290695 (https://phabricator.wikimedia.org/T136136) [14:21:02] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2326395 (10Ottomata) @Cmjohnson can do this now, ping me in #ops. [14:21:47] RECOVERY - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is OK: TCP OK - 0.002 second response time on port 9042 [14:21:54] @ottomata: cool, can you verify it's slot 3? [14:22:13] (03CR) 10Eevans: "1010-c just finished, so this can be merged at any time." [puppet] - 10https://gerrit.wikimedia.org/r/290694 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [14:22:17] cmjohnson1: you ready? i will dd some data [14:22:19] i put slot 5 in task but was wrong..../dev/sdd should be slot 3 [14:22:20] you looking at blinky lights? [14:22:29] yep [14:22:31] (03CR) 10Jcrespo: [C: 032] Depool db1056 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290695 (https://phabricator.wikimedia.org/T136136) (owner: 10Jcrespo) [14:22:41] k doing [14:23:02] (03PS3) 10DCausse: Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) [14:23:39] ottomata: nothing is happening [14:23:44] hmm [14:23:57] !log Starting cleanup on restbase1010-a.eqiad.wmnet : T134016 [14:23:58] T134016: RESTBase Cassandra cluster: Increase instance count from 2 to 3 - https://phabricator.wikimedia.org/T134016 [14:24:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 for hardware maintenance (duration: 00m 24s) [14:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:13] one sec [14:24:48] cmjohnson1: how about now? [14:24:54] (03PS3) 10Giuseppe Lavagetto: mediawiki: assign new eqiad appservers, install with jessie [puppet] - 10https://gerrit.wikimedia.org/r/290236 [14:25:07] sort of [14:25:20] but not can't say for sure [14:25:31] hmm ok lemme try something else [14:25:57] !log shutting down db1056 for hardware maintenance [14:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:28] ottomata: can you write to /dev/sdc [14:28:37] cmjohnson1: writing to sdd now, anything? [14:28:59] yes...slot 2 [14:29:11] ok want me to write to c? [14:29:13] (03PS1) 10Muehlenhoff: Provide a firejail profile for the image scalers [puppet] - 10https://gerrit.wikimedia.org/r/290696 (https://phabricator.wikimedia.org/T135111) [14:29:17] yes please [14:29:20] ok stopping d [14:29:32] writing to c now [14:29:44] okay...and sde plz [14:30:02] (03CR) 10Muehlenhoff: mediawiki: assign new eqiad appservers, install with jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290236 (owner: 10Giuseppe Lavagetto) [14:30:13] PROBLEM - MariaDB Slave IO: s4 on db1056 is CRITICAL: CRITICAL slave_io_state could not connect [14:30:14] ok, cmjohnson1 e going now [14:30:22] ottomata: okay slot 2 is /dev/sdd [14:30:22] PROBLEM - MariaDB Slave SQL: s4 on db1056 is CRITICAL: CRITICAL slave_sql_state could not connect [14:30:28] ok [14:30:29] will swap now [14:30:39] but I downtimed on icinga! [14:30:50] k cmjohnson1 umounted it go ahead [14:30:51] icinga says no [14:30:53] musta missed it barely :) [14:30:55] ah [14:30:56] (03PS2) 10BBlack: cache_text: raise FE mem from 1/8 to 1/4 total [puppet] - 10https://gerrit.wikimedia.org/r/288936 (https://phabricator.wikimedia.org/T135384) [14:30:58] host vs service? [14:31:11] it seems like you actually have to hit commmit for change to take place [14:31:32] jynus: yeah! :) [14:31:57] (03CR) 10BBlack: [C: 032 V: 032] cache_text: raise FE mem from 1/8 to 1/4 total [puppet] - 10https://gerrit.wikimedia.org/r/288936 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack) [14:32:42] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [14:33:54] ottomata: the disk is swapped if you want to add back [14:34:24] thanks cmjohnson1! [14:35:28] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2326437 (10Cmjohnson) Swapped the disk in slot 2 which was determined by myself and ottomata to be the location of /dev/sdd. [14:38:13] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 685 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6507266 keys - replication_delay is 685 [14:38:53] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:39:04] (03PS2) 10Muehlenhoff: Provide a firejail profile for the image scalers [puppet] - 10https://gerrit.wikimedia.org/r/290696 (https://phabricator.wikimedia.org/T135111) [14:40:09] 06Operations, 10ops-eqiad: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2326469 (10Cmjohnson) @robh did you receive this? [14:44:09] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2326485 (10jcrespo) a:03jcrespo We will wait for it to go to optimal state before repooling it ``` megacli -AdpBbuCmd -aALL BBU status for Adapter: 0... [14:47:16] (03PS2) 10Filippo Giunchedi: enable instance restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/290694 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [14:47:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] enable instance restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/290694 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [14:49:54] PROBLEM - puppet last run on mw2068 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [14:50:20] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2326506 (10jcrespo) Starting mysql now: ``` Run time to empty: Battery is not being charged. Write Policy : WB ``` [14:50:33] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.909 second response time [14:50:44] !log restarted hhvm on mw1116 and mw1117 (got stuck, output of hhvm-dump-debug available) [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:03] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.285 second response time [14:51:12] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 66910 bytes in 1.358 second response time [14:52:14] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 66910 bytes in 2.025 second response time [14:52:46] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2326525 (10Cmjohnson) [14:52:57] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2326538 (10Cmjohnson) [14:53:12] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:53:46] (03PS1) 10Hashar: wmflib: allow require_package('g++') [puppet] - 10https://gerrit.wikimedia.org/r/290697 [14:54:03] RECOVERY - puppet last run on mw2068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:54:45] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136#2326540 (10Cmjohnson) 05Open>03Resolved Replaced the BBU appears to be okay. [14:55:32] (03CR) 10Hashar: "I have tried writing a spec for the function but that fails eventually with something like:" [puppet] - 10https://gerrit.wikimedia.org/r/290697 (owner: 10Hashar) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160525T1500). [15:00:04] mobrovac jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:12] !log Bootstrap restbase1012-c.eqiad.wmnet : T134016 [15:00:14] T134016: RESTBase Cassandra cluster: Increase instance count from 2 to 3 - https://phabricator.wikimedia.org/T134016 [15:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:37] I can SWAT mobrovac jan_drewniak ping me when you're around [15:01:42] !log restbase cassandra truncated local_group_globaldomain_T_mathoid_svg.data [15:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:04] thcipriani: i'm here obviously :) [15:02:10] :) [15:02:44] (03PS2) 10Thcipriani: Math: Enable MathML on all wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290644 (https://phabricator.wikimedia.org/T131177) (owner: 10Mobrovac) [15:03:03] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290644 (https://phabricator.wikimedia.org/T131177) (owner: 10Mobrovac) [15:03:40] (03Merged) 10jenkins-bot: Math: Enable MathML on all wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290644 (https://phabricator.wikimedia.org/T131177) (owner: 10Mobrovac) [15:05:41] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:290644|Math: Enable MathML on all wikibooks]] (duration: 00m 29s) [15:05:44] ^ mobrovac check please [15:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:50] * mobrovac doing it [15:06:23] PROBLEM - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is CRITICAL: Connection refused [15:07:02] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2326565 (10Andrew) @Volans -- thanks! [15:08:14] thcipriani: will take another minute, sorry [15:08:21] no problem [15:11:13] thcipriani: mmm, doesn't seem to work [15:11:26] ah, let me try to purge the page, sec [15:11:55] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is CRITICAL: Connection refused eevans Node is bootstrapping - The acknowledgement expires at: 2016-05-26 15:11:41. [15:13:11] thcipriani: kk, all good, false alarm [15:13:20] aka SNAFU :) [15:13:32] mobrovac: cool, thanks for checking :) [15:14:04] jan_drewniak: ping me when you're around and we can do the portals bump [15:15:45] thcipriani: sorry! Lost track of time. I'm available now if you are [15:15:59] jan_drewniak: yup [15:16:37] (03PS2) 10Thcipriani: T136019 updating portal stats. Bumping portals to master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290683 (https://phabricator.wikimedia.org/T136019) (owner: 10Jdrewniak) [15:16:52] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290683 (https://phabricator.wikimedia.org/T136019) (owner: 10Jdrewniak) [15:17:51] (03PS4) 10Giuseppe Lavagetto: mediawiki: assign new eqiad appservers, install with jessie [puppet] - 10https://gerrit.wikimedia.org/r/290236 [15:17:54] (03Merged) 10jenkins-bot: T136019 updating portal stats. Bumping portals to master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290683 (https://phabricator.wikimedia.org/T136019) (owner: 10Jdrewniak) [15:18:49] (03CR) 10Giuseppe Lavagetto: mediawiki: assign new eqiad appservers, install with jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290236 (owner: 10Giuseppe Lavagetto) [15:19:19] (03PS5) 10Giuseppe Lavagetto: mediawiki: assign new eqiad appservers, install with jessie [puppet] - 10https://gerrit.wikimedia.org/r/290236 [15:20:22] blerg. Could I get a root to chown -R mwdeploy:wikidev /srv/mediawiki-staging/.git/modules/portals/objects? [15:20:36] <_joe_> thcipriani: ack [15:20:59] <_joe_> thcipriani: done [15:21:04] _joe_: thank you! [15:21:27] <_joe_> thcipriani: btw, I have a scap bug to report, but maybe later/tomorrow; I am very busy atm [15:21:41] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: assign new eqiad appservers, install with jessie [puppet] - 10https://gerrit.wikimedia.org/r/290236 (owner: 10Giuseppe Lavagetto) [15:21:55] (03CR) 10Filippo Giunchedi: [C: 031] service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [15:22:14] _joe_: kk, I'll look out for it. Thank you. [15:23:22] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:23:57] <_joe_> this ^^ is me reenabling it [15:24:47] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:290683|T136019 updating portal stats.]] (duration: 00m 23s) [15:24:48] T136019: Please update HTML (wikipedia, wikibooks, etc.) project portals - https://phabricator.wikimedia.org/T136019 [15:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:15] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:290683|T136019 updating portal stats.]] (duration: 00m 26s) [15:25:16] T136019: Please update HTML (wikipedia, wikibooks, etc.) project portals - https://phabricator.wikimedia.org/T136019 [15:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:23] ^ jan_drewniak check please [15:25:32] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:25:47] _joe_: reenabling scap on mw2017? [15:26:00] thcipriani: thanks! [15:27:31] <_joe_> thcipriani: nope, puppet [15:27:53] <_joe_> !log imaging mw1261, with debian jessie [15:27:58] oh :) [15:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:18] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2326636 (10Cmjohnson) Return part shipping information USPS 9202 3946 5301 2432 0845 53 FEDEX 9611918 2393026 53576230 [15:28:23] !log applying third schema change on x1 hosts T135699 [15:28:24] T135699: Schema changes for Echo moderation - https://phabricator.wikimedia.org/T135699 [15:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:36] 06Operations, 10Monitoring: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#1256586 (10Volans) @faidon would you agree to add in the meanwhile `hpssacli` to the list of installed packages so that at least we can do checks when needed manually or through salt? I saw that in `modul... [15:30:07] cmjohnson1: do you have a minute? [15:33:26] !log rolling restart of global text frontend memory caches for upsizing - 15 min spacing, ~8H to completion - T135384 [15:33:27] T135384: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384 [15:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:41] 06Operations, 10cassandra: Assign 'c' instance IPs for restbase100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T136206#2326678 (10Eevans) [15:35:40] 06Operations, 10cassandra: Assign 'c' instance IPs for restbase100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T136206#2326678 (10Eevans) [15:37:02] elukey: what's up? [15:38:00] super newbie question - I can't see the new disk, even after reboot. Any chance that I'd need to execute a horrible megacli command to expose the disk to the OS? [15:38:12] tried also to reboot and all the partitions shifted by one [15:38:58] elukey: yep horrible megacli commands...i believe you have to clear the caceh [15:39:07] give me a sec..i have a note on it [15:41:53] (03PS1) 10Jcrespo: Revert "Depool db1056 for hardware maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290701 [15:43:13] elukey: check for foreign cfg with megacli [15:43:15] megacli -CfgForeign -Scan -a0 [15:43:30] cmjohnson1: they have a logical volume per disk, Virtual Drive 3 is missing and the disk is shown as Unconfigured(good) so I think it needs to be add to the logical volume ID3 [15:43:42] yes, that should help [15:43:55] dbproxy1004 is flopping a lot lately (not sure if it is the master or the backup) [15:44:09] elukey: you can clear it by using megacli -CfgForeign -Clear -a0 [15:45:15] then you will probably have to format as linux raid and add to array [15:46:26] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:46:36] oh....one other thing...before formatting you will need to add to jbod on raid controller...use megacli -PDMakeJBOD -PhysDrv\[32:2\] -a0 [15:46:40] mmm [15:46:54] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2326725 (10hashar) Given how jsduck is present in a wide range of jobs, I am just going to provision jsduck from ruby gems with somethin... [15:47:16] I think it is db1047 [15:47:21] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2326727 (10hashar) [15:47:40] cmjohnson1: do you mind to add this to https://phabricator.wikimedia.org/T134056? [15:47:53] so passive host failure [15:47:55] so I'll have a clear list of things, too messy in here :) [15:48:24] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [15:49:14] LOL 300 queries, 661,655 seconds per query [15:49:27] *in the last hour* [15:49:54] someone is probably unintentionally DOSsing that host [15:50:29] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2326754 (10Cmjohnson) megacli -CfgForeign -Scan -a0 There are 1 foreign configuration(s) on controller 0. @elukey this is what I have for notes on how to get the disk back... [15:50:41] jynus: rings a bell, let me check one thing [15:50:57] reasearch, let me identify the user [15:51:28] I saw a 600k long query when you were on vacation, trying to remember where I have it [15:52:57] (03CR) 1020after4: "secret(): invalid secret keyholder/deploy_service" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [15:53:43] hey, Wes just approved my NDA, https://phabricator.wikimedia.org/T134651 Can you give me access ? [15:54:01] First, LDAP nda group [15:54:30] volans, I got the user on stat3 [15:54:48] oh, no, it is another user executing queries [15:54:53] !log puppet fails on gallium (Precise) due to E: Unable to locate package firejail [15:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:04] no clue why though [15:56:49] oh that is from mediawiki::packages::multimedia which is still installed on gallium (should not anymore) [15:56:50] jynus: I got one [15:57:04] twentyafterfour: I'm about to jump in a meeting but I'll be available later, afaics the key name in labs/private is servicedeploy not service_deploy [15:57:15] err, deploy_service, you get the idea [15:58:10] volans, I got it too [15:59:09] (03CR) 10Thcipriani: [C: 04-1] "Ran into a problem in beta cluster that is the same thing you got in the puppet compiler" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [15:59:40] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2326815 (10Jdforrester-WMF) 05Open>03Resolved a:03Jdforrester-WMF [15:59:46] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#1537429 (10Jdforrester-WMF) a:05Jdforrester-WMF>03hashar [16:03:34] (03CR) 1020after4: [C: 031] Add "Lua" to syntax highlighting dropdown choices in Phab's "Paste" [puppet] - 10https://gerrit.wikimedia.org/r/290409 (https://phabricator.wikimedia.org/T100900) (owner: 10Aklapper) [16:07:44] (03PS1) 10Giuseppe Lavagetto: partman: add new recipe for sw-raid backed appservers [puppet] - 10https://gerrit.wikimedia.org/r/290707 [16:09:25] (03PS2) 10Giuseppe Lavagetto: partman: add new recipe for sw-raid backed appservers [puppet] - 10https://gerrit.wikimedia.org/r/290707 [16:10:01] (03CR) 10Giuseppe Lavagetto: [C: 032] partman: add new recipe for sw-raid backed appservers [puppet] - 10https://gerrit.wikimedia.org/r/290707 (owner: 10Giuseppe Lavagetto) [16:10:16] (03CR) 10Giuseppe Lavagetto: [V: 032] partman: add new recipe for sw-raid backed appservers [puppet] - 10https://gerrit.wikimedia.org/r/290707 (owner: 10Giuseppe Lavagetto) [16:13:00] !log kill very long query on db1047 (ID 89274525, client disconnected) T136214 [16:13:00] T136214: Query from stat1003 brought down db1047 - https://phabricator.wikimedia.org/T136214 [16:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:19] !log reloading haproxy config to repool db1047 [16:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:23] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 1 failures [16:23:13] RECOVERY - RAID on db1023 is OK: OK: optimal, 1 logical, 2 physical [16:23:25] cool [16:26:22] 06Operations, 10ops-eqiad: db1023 Degraded RAID - https://phabricator.wikimedia.org/T135157#2327067 (10Volans) 05Open>03Resolved Rebuild completed, RAID back optimal. Thanks @Cmjohnson ! [16:28:23] (03PS2) 10Jcrespo: Revert "Depool db1056 for hardware maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290701 [16:28:37] (03CR) 10DCausse: [C: 04-1] "not to be merged now." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) (owner: 10DCausse) [16:32:12] 06Operations, 10Monitoring: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#2327092 (10faidon) Yes, definitely :) [16:34:56] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, 13Patch-For-Review: Vary mobile HTML by connection speed - https://phabricator.wikimedia.org/T119798#2327099 (10Jdlrobson) Should we decline this task @ori? Doesn't feel like this will happen since we'd prefer to give the same experience... [16:37:33] 06Operations, 06Performance-Team, 07Availability: Audit mysql database class and hhvm binding support of SSL - https://phabricator.wikimedia.org/T136218#2327109 (10aaron) [16:38:46] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1056 for hardware maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290701 (owner: 10Jcrespo) [16:39:27] (03CR) 10Dzahn: [C: 04-1] "the redirect itself looks ok, but the target URL does not work for me currently. http://wikimediapakistan.org just times out and i dont se" [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) (owner: 10Dereckson) [16:40:36] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 after hardware maintenance (duration: 00m 23s) [16:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:43] (03CR) 10Dereckson: "You're right: it still worked this morning, when I asked Kelson to look for the lack of https, but doesn't work currently." [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) (owner: 10Dereckson) [16:42:34] (03CR) 10Dzahn: "i am wondering if a phab ticket makes sense for this case where we need WMCH, Pakistan user group and ops on it" [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) (owner: 10Dereckson) [16:43:43] (03CR) 1020after4: "filippo: can you move servicedeploy and servicedeploy.pub to deploy_service and deploy_service.pub?" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [16:44:09] (03CR) 10Dzahn: "well, https://phabricator.wikimedia.org/T56780 is linked and Saqib is CCed, so that" [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) (owner: 10Dereckson) [16:44:43] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:45:34] Dereckson: i think username "80686" is the admin :) [16:45:44] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2327156 (10jcrespo) [16:45:46] 06Operations, 06Performance-Team, 07Availability: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2327155 (10jcrespo) [16:46:39] 06Operations, 10RESTBase-Cassandra: Grafana bugginess; Graph scales sometimes off by an order of magnitude - https://phabricator.wikimedia.org/T121789#2327162 (10Eevans) >>! In T121789#2038406, @fgiunchedi wrote: > I think that has to do on how we aggregated `.count` metrics > > ``` > [count] > aggregationMet... [16:46:58] (03PS24) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [16:47:09] mutante: oh, I thought the admin was Manuel Schneider [16:47:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] WIP - Create necessary folders for Postgresql and Cassandra (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [16:48:03] Dereckson: https://phabricator.wikimedia.org/p/80686/ [16:48:25] Confirmed so. [16:48:40] IRC Nick root-80686 nice [16:48:53] needs "IRC network" field [16:48:59] 06Operations, 06Performance-Team, 07Availability: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2277856 (10jcrespo) I've put T111654 as a blocker, but in reality, cross datacenter writes using SSL should be already possible in all cases, as I can guarantee... [16:48:59] in profile pages [16:51:05] @seen root-80686 [16:51:05] mutante: I have never seen root-80686 [16:51:26] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, 13Patch-For-Review: Vary mobile HTML by connection speed - https://phabricator.wikimedia.org/T119798#2327193 (10atgo) @jdlrobson - please forgive me if I'm missing the discussion, but what's the rationale for providing the same experienc... [17:00:04] csteipp dapatrick: Dear anthropoid, the time has come. Please deploy Deploy Ex:OATHAuth to centralauth wikis - Task T107605 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160525T1700). [17:00:04] T107605: Support two-factor authentication on CentralAuth wikis - https://phabricator.wikimedia.org/T107605 [17:00:22] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6495769 keys - replication_delay is 0 [17:01:07] (03PS1) 10Volans: Monitoring: Install vendor specific RAID tool [puppet] - 10https://gerrit.wikimedia.org/r/290717 (https://phabricator.wikimedia.org/T97998) [17:01:25] (03CR) 10Alexandros Kosiaris: "20after4: Done" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [17:02:28] dapatrick: What to do the honors and +2 https://gerrit.wikimedia.org/r/#/c/290271/ ? [17:03:23] csteipp: I can only +1 ;( [17:03:28] (03CR) 10Dpatrick: [C: 031] Enable Ex:OATH on CentralAuth wikis, limited rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290271 (https://phabricator.wikimedia.org/T107605) (owner: 10CSteipp) [17:03:43] doh. We need to fix that.. [17:03:56] (03CR) 10CSteipp: [C: 032] Enable Ex:OATH on CentralAuth wikis, limited rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290271 (https://phabricator.wikimedia.org/T107605) (owner: 10CSteipp) [17:04:14] (03PS1) 10Alexandros Kosiaris: keyholder: service_deploy to deploy_service [labs/private] - 10https://gerrit.wikimedia.org/r/290718 [17:06:52] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] keyholder: service_deploy to deploy_service [labs/private] - 10https://gerrit.wikimedia.org/r/290718 (owner: 10Alexandros Kosiaris) [17:07:37] (03PS2) 10CSteipp: Enable Ex:OATH on CentralAuth wikis, limited rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290271 (https://phabricator.wikimedia.org/T107605) [17:10:57] (03CR) 10CSteipp: [C: 032] Enable Ex:OATH on CentralAuth wikis, limited rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290271 (https://phabricator.wikimedia.org/T107605) (owner: 10CSteipp) [17:11:34] (03Merged) 10jenkins-bot: Enable Ex:OATH on CentralAuth wikis, limited rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290271 (https://phabricator.wikimedia.org/T107605) (owner: 10CSteipp) [17:21:02] RECOVERY - RAID on db1034 is OK: OK: optimal, 1 logical, 2 physical [17:22:03] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2327282 (10elukey) I checked the commands that @Cmjohnson provided and executed ony the -PDMakeJBOD one since the previous two were a no-op (No foreign config found). Also dou... [17:22:20] 06Operations, 10ops-eqiad: db1034 degraded RAID - https://phabricator.wikimedia.org/T135353#2327283 (10Volans) 05Open>03Resolved Rebuild completed, array back optimal. Thanks @Cmjohnson [17:25:38] !log dpatrick@tin Synchronized wmf-config/InitialiseSettings.php: Enabling OATHAuth on CentralAuth wikis (duration: 00m 24s) [17:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:49] (03PS5) 10Volans: Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [17:35:19] (03CR) 10Krinkle: [C: 031] "Works for me :) (Run manually from terbium)." [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [17:38:08] (03CR) 10Jcrespo: [C: 031] Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [17:39:55] andrewbogott: a puppet run on labtestweb2001 would be nice, to get it to assign the v6 IP, but if that breaks your tests then it can wait [17:40:13] no, that's fine, I'll enable it [17:40:22] :) thx [17:41:40] (03PS2) 10Dzahn: add AAAA record for labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/290542 [17:42:11] yep, i saw it add it. cool [17:43:53] and.. yes, good that we did it this way, need to amend [17:45:48] no, it's right 2620:0:861:1:208:80:153:14 [17:47:44] (03PS3) 10Dzahn: add AAAA record for labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/290542 [17:49:14] (03CR) 10Dzahn: [C: 032] "root@labtestweb2001:~# ip a s | grep inet6 | grep global" [dns] - 10https://gerrit.wikimedia.org/r/290542 (owner: 10Dzahn) [17:49:58] uhmm.. unmerged DNS changes that are before mine [17:50:05] related to es servers [17:50:30] i mean, merged but authdns-update did to run yet [17:51:18] robh: decom for es ^ ? [17:52:11] (03PS2) 10Aaron Schulz: Fix slave lag wait calls for read-only ES clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289575 (https://phabricator.wikimedia.org/T135690) [17:54:49] (03CR) 10Aaron Schulz: [C: 032] Fix slave lag wait calls for read-only ES clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289575 (https://phabricator.wikimedia.org/T135690) (owner: 10Aaron Schulz) [17:56:02] !log rolling restart of global text frontend memory caches for upsizing - reduce spacing to 5 mins, will finish notably faster - T135384 [17:56:03] T135384: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384 [17:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:08] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2275897 (10Dzahn) when making an unrelated DNS change, "authdns-update" told me there are pending changes that remove these hosts from DNS. I have not touched... [17:57:10] (03PS3) 10Aaron Schulz: Fix slave lag wait calls for read-only ES clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289575 (https://phabricator.wikimedia.org/T135690) [17:58:06] (03CR) 10Aaron Schulz: [C: 032] Fix slave lag wait calls for read-only ES clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289575 (https://phabricator.wikimedia.org/T135690) (owner: 10Aaron Schulz) [17:58:43] (03Merged) 10jenkins-bot: Fix slave lag wait calls for read-only ES clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289575 (https://phabricator.wikimedia.org/T135690) (owner: 10Aaron Schulz) [17:58:44] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:59:02] mutante: sorry, yes [17:59:07] if its 2005-2010 [17:59:59] !log aaron@tin Started scap: file wmf-config/db-codfw.php Fix slave lag wait calls for read-only ES clusters [18:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:36] robh: yes, it is 2005 through 2010, i'll merge them both [18:01:39] thx [18:02:52] !log aaron@tin scap aborted: file wmf-config/db-codfw.php Fix slave lag wait calls for read-only ES clusters (duration: 02m 53s) [18:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:07] (03CR) 10Dzahn: "dig AAAA labtestwikitech.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/290542 (owner: 10Dzahn) [18:03:27] !log aaron@tin Synchronized wmf-config/db-codfw.php: Fix slave lag wait calls for read-only ES clusters (duration: 00m 23s) [18:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:59] (03PS3) 10Dzahn: ircecho: set user/group/mode for unit file [puppet] - 10https://gerrit.wikimedia.org/r/290619 [18:04:06] !log aaron@tin Synchronized wmf-config/db-eqiad.php: Fix slave lag wait calls for read-only ES clusters (duration: 00m 27s) [18:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:05:21] (03CR) 10Dzahn: [C: 032] ircecho: set user/group/mode for unit file [puppet] - 10https://gerrit.wikimedia.org/r/290619 (owner: 10Dzahn) [18:06:25] (03CR) 10Dzahn: "all other unit files are root owned, this was "998:998"" [puppet] - 10https://gerrit.wikimedia.org/r/290619 (owner: 10Dzahn) [18:07:51] (03PS2) 1020after4: Add "Lua" to syntax highlighting dropdown choices in Phab's "Paste" [puppet] - 10https://gerrit.wikimedia.org/r/290409 (https://phabricator.wikimedia.org/T100900) (owner: 10Aklapper) [18:08:46] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2327361 (10jcrespo) but I have not yet committed: gerrit:287645 Maybe robh did? [18:09:45] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2327362 (10RobH) There was a pending change by me to remove es2005-es2010 as they are ready for wipe. We fixed via irc chat. [18:10:15] akosiaris: I get "secret(): invalid secret keyholder/dumps" when running puppet compiler on https://gerrit.wikimedia.org/r/#/c/289236/ [18:10:34] akosiaris: what is the name of the key file for the db dump script? [18:10:55] twentyafterfour: we can usually fix that by adding fake secrets in labs/private [18:11:21] mutante: yes but I'm preparing for production deployment so I need it to be correct in prod [18:11:43] right, ok, want me to take a look at private repo? [18:11:57] yeah in secrets/keyholder [18:12:08] there should be a db dump key I don't know if it's named wrong or missing [18:12:11] hm, I don't see something in labs private [18:12:21] "dumpsdeploy" ? [18:12:40] that's a key there [18:12:51] twentyafterfour: the names should match the keyholder key names on tin, so yeah what mutante said 'dumpsdeploy' [18:12:55] that's it .. [18:12:56] yeah, but it's missing from the labs/private repo [18:13:03] modules/secret/secrets/keyholder# file dumpsdeploy [18:13:04] dumpsdeploy: PEM RSA private key [18:13:09] akosiaris: you're right I'll fix labs [18:13:11] "should", I should probably have matched what you put in labs/private instead [18:13:27] godog: that's fine [18:13:45] matching the key names on tin is correct, there were just some keys missing on labs I guess [18:14:07] (03PS1) 10Muehlenhoff: Only require firejail on trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/290723 [18:14:22] the name is defined in https://gerrit.wikimedia.org/r/#/c/289236/24/hieradata/common/scap/server.yaml,cm [18:14:33] yeah that's a real pain, there's no manifest of what keys/files should be in private.git anywhere [18:14:34] (03PS1) 10Yuvipanda: tools: Turn on puppet nag mails [puppet] - 10https://gerrit.wikimedia.org/r/290724 (https://phabricator.wikimedia.org/T136167) [18:14:34] and in scap::target resources [18:14:44] andrewbogott: chasemp valhallasw`cloud can you +1 ^ [18:14:59] (03PS2) 10Rush: tools: Turn on puppet nag mails [puppet] - 10https://gerrit.wikimedia.org/r/290724 (https://phabricator.wikimedia.org/T136167) (owner: 10Yuvipanda) [18:15:00] godog: the list in hieradata is what defines the names [18:15:04] is this a per host email or per project w/ list of hosts? [18:15:18] per host [18:15:40] (03CR) 10Andrew Bogott: [C: 031] "We're going to regret this!" [puppet] - 10https://gerrit.wikimedia.org/r/290724 (https://phabricator.wikimedia.org/T136167) (owner: 10Yuvipanda) [18:15:45] alternatively, someone could actively check http://shinken.wmflabs.org/problems ;-) [18:15:45] ugh that's painful but I guess if everyone is doing it then that may prompt us to fix it [18:16:02] twentyafterfour: for scap indeed, not for secret, well except the pupept code of course [18:16:02] chasemp: it's only if puppet has failed for 24 hours [18:16:13] so transient downtime won't email, only for-real breakage [18:16:29] I'm willing to give it a whirl and see how it goes [18:16:37] however, given that shinken is currently so noisy it's useless, it's probably fine [18:16:43] twentyafterfour: want the config change for Lua syntax highlighting now? [18:16:54] (03CR) 10Rush: [C: 031] "sure let's try it" [puppet] - 10https://gerrit.wikimedia.org/r/290724 (https://phabricator.wikimedia.org/T136167) (owner: 10Yuvipanda) [18:17:47] valhallasw`cloud: you wanna +1 too? :D [18:18:03] mutante: at your convenience [18:18:27] (03CR) 10Merlijn van Deen: "+0 (we should really make shinken alerts less noisy and more visible instead)" [puppet] - 10https://gerrit.wikimedia.org/r/290724 (https://phabricator.wikimedia.org/T136167) (owner: 10Yuvipanda) [18:18:38] phab config changes like that should roll out gracefully [18:19:03] apache restart may be needed to put it into effect [18:19:05] (03CR) 10Dzahn: [C: 032] Add "Lua" to syntax highlighting dropdown choices in Phab's "Paste" [puppet] - 10https://gerrit.wikimedia.org/r/290409 (https://phabricator.wikimedia.org/T100900) (owner: 10Aklapper) [18:19:09] (03CR) 10Yuvipanda: [C: 032] "I agree, but $TIME unfortunately..." [puppet] - 10https://gerrit.wikimedia.org/r/290724 (https://phabricator.wikimedia.org/T136167) (owner: 10Yuvipanda) [18:19:18] twentyafterfour: ok! thx [18:20:49] (03PS3) 10Yuvipanda: tools: Turn on puppet nag mails [puppet] - 10https://gerrit.wikimedia.org/r/290724 (https://phabricator.wikimedia.org/T136167) [18:20:57] (03CR) 10Yuvipanda: [V: 032] tools: Turn on puppet nag mails [puppet] - 10https://gerrit.wikimedia.org/r/290724 (https://phabricator.wikimedia.org/T136167) (owner: 10Yuvipanda) [18:20:58] godog: maybe usage of secret() should be more formalized like it is with keyholder (after my patch) [18:21:40] it's definitely confusing - secret() is at least an improvement over haphazard puppet:/// urls scattered about all over puppet code [18:25:00] !log restarting varnish-frontend on cp3048 with lg_dirty_multi:6 unpuppetized - T135384 [18:25:01] T135384: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384 [18:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:09] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:28:53] twentyafterfour: it works without restarting [18:30:23] twentyafterfour: yeah an improvement for sure, the puppet:/// thing also means all machines see all secrets via the fileserver (!) anyways yeah basically having a way to regenerate private material, then one copy is public/dummy the other isn't [18:30:55] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Use the new service::uwsgi define [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [18:31:12] (03PS4) 10Alexandros Kosiaris: ores: Use the new service::uwsgi define [puppet] - 10https://gerrit.wikimedia.org/r/288618 [18:31:20] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Use the new service::uwsgi define [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [18:37:56] 06Operations, 10Monitoring, 07Icinga: re-create script for manual paging - https://phabricator.wikimedia.org/T82937#2327423 (10Dzahn) a:03Dzahn [18:38:04] 06Operations, 10Monitoring, 07Icinga: re-create script for manual paging - https://phabricator.wikimedia.org/T82937#907272 (10Dzahn) p:05Normal>03Low [18:39:42] 06Operations, 06Security-Team, 10Wikimedia-General-or-Unknown: Non-NDA users cannot access graphite.wikimedia.org - https://phabricator.wikimedia.org/T56713#2327427 (10Dzahn) @metatron when looking at T108546 would you consider this ticket a duplicate ? [18:39:59] (03PS25) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [18:40:36] (03PS1) 10Alexandros Kosiaris: Remove non ASCII character from scap.pp [puppet] - 10https://gerrit.wikimedia.org/r/290728 [18:40:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove non ASCII character from scap.pp [puppet] - 10https://gerrit.wikimedia.org/r/290728 (owner: 10Alexandros Kosiaris) [18:44:37] moritzm: the firejail change affected host gallium [18:44:53] cant find the package there [18:45:34] akosiaris: I am marking service::deploy::scap as deprecated because it just wraps scap::target without adding any value (and it's only used in one place) ... keyholder patch amended to include that change [18:45:36] i see your gerrit change!.. [18:45:48] (03PS26) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [18:46:13] (03CR) 10Dzahn: [C: 031] "happened to see that on icinga, yea, fails on gallium but that is like the only one" [puppet] - 10https://gerrit.wikimedia.org/r/290723 (owner: 10Muehlenhoff) [18:46:21] twentyafterfour: ok. thanks for the heads up [18:47:27] ACKNOWLEDGEMENT - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://gerrit.wikimedia.org/r/#/c/290723/1 [18:48:26] ACKNOWLEDGEMENT - Hadoop NodeManager on analytics1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager daniel_zahn host in downtime [18:49:16] ACKNOWLEDGEMENT - tilerator on maps2001 is CRITICAL: Connection refused daniel_zahn hosts in downtime [18:49:16] ACKNOWLEDGEMENT - tileratorui on maps2001 is CRITICAL: Connection refused daniel_zahn hosts in downtime [18:49:16] ACKNOWLEDGEMENT - tilerator on maps2002 is CRITICAL: Connection refused daniel_zahn hosts in downtime [18:49:16] ACKNOWLEDGEMENT - tileratorui on maps2002 is CRITICAL: Connection refused daniel_zahn hosts in downtime [18:49:16] ACKNOWLEDGEMENT - tilerator on maps2003 is CRITICAL: Connection refused daniel_zahn hosts in downtime [18:49:16] ACKNOWLEDGEMENT - tileratorui on maps2003 is CRITICAL: Connection refused daniel_zahn hosts in downtime [18:49:17] ACKNOWLEDGEMENT - tilerator on maps2004 is CRITICAL: Connection refused daniel_zahn hosts in downtime [18:49:17] ACKNOWLEDGEMENT - tileratorui on maps2004 is CRITICAL: Connection refused daniel_zahn hosts in downtime [18:49:42] mutante: see backscroll, already made a patch, but need feedback from Antoine: https://gerrit.wikimedia.org/r/290723 [18:50:27] moritzm: yes, i saw it and +1'd , thanks [18:51:01] feel free to merge, I'm AFK for dinner in a few minutes [18:51:13] otherwise I'll do it tomorrow morning [18:51:17] ACKNOWLEDGEMENT - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 2 failures daniel_zahn hosts in downtime [18:51:17] ACKNOWLEDGEMENT - puppet last run on maps2002 is CRITICAL: CRITICAL: Puppet has 2 failures daniel_zahn hosts in downtime [18:51:17] ACKNOWLEDGEMENT - puppet last run on maps2003 is CRITICAL: CRITICAL: Puppet has 2 failures daniel_zahn hosts in downtime [18:51:17] ACKNOWLEDGEMENT - puppet last run on maps2004 is CRITICAL: CRITICAL: Puppet has 2 failures daniel_zahn hosts in downtime [18:51:46] (03PS2) 10Dzahn: Only require firejail on trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/290723 (owner: 10Muehlenhoff) [18:52:19] (03PS27) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [18:52:25] i'll do it, we know the list of precise hosts and no other was trying to install it per icinga [18:55:30] Hi [18:55:37] Reproting a replication glitch - https://quarry.wmflabs.org/query/9257 [18:55:44] *Reporting [18:56:12] These files show up on an API enquiry but aren't on production apparently... [18:56:25] (03PS28) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [18:56:45] hmm.. or not.. now i see changes in compiler that i did not expect to see [18:57:18] but that's probably because require_package vs. just package [18:57:44] (03CR) 10Dzahn: "compiler run on a trusty imagescaler: http://puppet-compiler.wmflabs.org/2914/mw1153.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/290723 (owner: 10Muehlenhoff) [18:58:14] !log restbase running a partial mobile-sections dump of eswiki for T135571 on restbase1009 [18:58:15] T135571: [BUG] [Content Service] Tapping random causes an unknown error sometimes - https://phabricator.wikimedia.org/T135571 [18:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:46] (03CR) 10Dzahn: "Only in old:" [puppet] - 10https://gerrit.wikimedia.org/r/290723 (owner: 10Muehlenhoff) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160525T1900). [19:01:23] dear jouncebot: doing the needful. [19:04:05] (03CR) 10Hashar: [C: 031] "That looks all fine :-}" [puppet] - 10https://gerrit.wikimedia.org/r/290723 (owner: 10Muehlenhoff) [19:04:26] (03PS1) 1020after4: group1 wikis to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290731 [19:04:44] (03CR) 1020after4: [C: 032] group1 wikis to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290731 (owner: 1020after4) [19:05:26] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290731 (owner: 1020after4) [19:07:09] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.3 [19:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:52] (03PS1) 10Ladsgroup: ores: Add watchdog check to precached systemd [puppet] - 10https://gerrit.wikimedia.org/r/290733 [19:08:11] Crap [19:08:30] aude: audephone: ping [19:08:39] Hi hoo [19:08:43] :( [19:08:46] Is there a probpem [19:09:04] Labels just disappeared [19:09:10] Omg [19:09:27] On test wikidata too [19:09:29] :( [19:09:46] I would put wikidata back on wmf2 [19:09:49] There's a replication issues I;m told [19:10:29] No one's considered a lock down? [19:10:31] And if we can investigate on test wikidata [19:10:38] ok [19:10:42] I can roll back [19:10:45] What's happening people? [19:11:09] ShakespeareFan00: labs replication is not a production issue. [19:11:10] ShakespeareFan00: I don't know anything about it? [19:11:28] What are lables anyway? [19:11:35] (03PS1) 10Alexandros Kosiaris: ores: Specify uwsgi plugins to override the default [puppet] - 10https://gerrit.wikimedia.org/r/290734 [19:11:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Specify uwsgi plugins to override the default [puppet] - 10https://gerrit.wikimedia.org/r/290734 (owner: 10Alexandros Kosiaris) [19:11:56] audephone: That's because of what lego did to linker almost certainly [19:12:11] Hoo ok [19:12:27] Pretty sure kitten item was ok [19:12:31] (03PS1) 1020after4: wikidata back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290735 [19:12:33] Me checks [19:12:52] so rollback? [19:13:07] wikidatawiki only? [19:13:42] twentyafterfour I think so [19:13:51] (03CR) 1020after4: [C: 032] wikidata back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290735 (owner: 1020after4) [19:14:09] Though I'll need to find something affected on test wikidata [19:14:14] What broke? [19:14:27] (03Merged) 10jenkins-bot: wikidata back to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290735 (owner: 1020after4) [19:14:37] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: roll back wikidata to wmf.2 [19:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:00] hoo, audephone: that fixed it? [19:15:14] yeah [19:17:21] btw I now get SMS whenever the train is deployed :) [19:18:07] I can try to investigate later and hope it's an easy fix [19:18:23] audephone: Think I found the problem [19:18:31] Hoo yay [19:19:56] I'm seeing a lot of "sessions are disabled for this entry point" now from wmf.3 [19:20:07] php-1.28.0-wmf.3/includes/session/SessionManager.php [19:20:35] audephone: I don't see a nice way to fix it, though [19:20:44] :( [19:22:48] audephone: I'm wrong, think there is [19:23:05] Hope so [19:24:52] Got it [19:25:11] :) [19:25:37] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:27:29] audephone: legoktm: twentyafterfour: https://gerrit.wikimedia.org/r/290737 [19:27:48] Looking [19:28:23] thanks :) [19:29:09] (03CR) 10Alexandros Kosiaris: [C: 031] ores: Add watchdog check to precached systemd [puppet] - 10https://gerrit.wikimedia.org/r/290733 (owner: 10Ladsgroup) [19:29:21] If Lego is around would be better if he reviews [19:29:33] Yeah, would be nice [19:29:52] If I get home in a bit and it's not merged, then can look again [19:30:00] audephone: On the ferry? [19:30:21] Next week [19:30:31] Oh, right [19:30:34] * hoo confused [19:31:04] !log Start cleanup of restbase2001-b.codfw.wmnet : T1340116 [19:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:07] is legoktm in the house? [19:32:35] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add watchdog check to precached systemd [puppet] - 10https://gerrit.wikimedia.org/r/290733 (owner: 10Ladsgroup) [19:32:39] (03PS2) 10Alexandros Kosiaris: ores: Add watchdog check to precached systemd [puppet] - 10https://gerrit.wikimedia.org/r/290733 (owner: 10Ladsgroup) [19:32:44] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add watchdog check to precached systemd [puppet] - 10https://gerrit.wikimedia.org/r/290733 (owner: 10Ladsgroup) [19:34:25] twentyafterfour: that happens when someone does session handling in a load.php or similar call [19:34:32] got a stack trace for it? [19:35:21] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2327620 (10hashar) Trusty: JSDuck 5.3.4 (Ruby 1.9.3) Jessie: JSDuck 5.3.4 (Ruby 2.1.5) [19:38:53] tgr: https://phabricator.wikimedia.org/P3178 I think it's the same bug as https://phabricator.wikimedia.org/T136124 [19:40:00] jouncebot: net [19:40:03] jouncebot: next [19:40:03] In 0 hour(s) and 19 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160525T2000) [19:40:37] Mutante still didn't get rcstream working for me [19:40:44] (03PS2) 10Dzahn: l10nupdate: Stop using deprecated refreshCdbJsonFiles script [puppet] - 10https://gerrit.wikimedia.org/r/289886 (owner: 10BryanDavis) [19:40:49] twentyafterfour: it is [19:40:59] (03CR) 10Dzahn: [C: 032] l10nupdate: Stop using deprecated refreshCdbJsonFiles script [puppet] - 10https://gerrit.wikimedia.org/r/289886 (owner: 10BryanDavis) [19:41:02] For now I am scraping the irc feed [19:41:33] Not sure I have access such as to rcs1001 to debug [19:41:58] if redis is getting the publish events [19:43:50] audephone: yes,sorry i dont really know at this point. debugged and fixed the firewall, i can confirm this; [19:43:58] root@rcs1001:~# tcpdump port 6379 and src silver.wikimedia.org [19:44:13] 19:43:50.236112 IP6 silver.wikimedia.org.55688 > 2620:0:861:103:10:64:32:148.6379: ... [19:44:29] so i dont know why rcstream itself isnt picking it up [19:44:36] I would try redis-cli monitor and grep for wikitech [19:44:54] To see if it is getting anything [19:45:22] Obviously I am in mobile right now and it's not urgent [19:45:48] (03PS1) 10Ladsgroup: ores: Remove verbose and increase timeout for watchdog [puppet] - 10https://gerrit.wikimedia.org/r/290740 [19:46:04] There is getting stuff to redis whi Ch is directly from mediawiki [19:46:22] The rcstream does subscribe part [19:46:32] audephone: it does not, nothing with redis-cli monitor | grep wikitech [19:46:50] Even when making an edit [19:46:53] yes [19:47:12] Then it's somewhere between mediawiki and redis [19:47:40] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Remove verbose and increase timeout for watchdog [puppet] - 10https://gerrit.wikimedia.org/r/290740 (owner: 10Ladsgroup) [19:47:58] Or maybe it uses the dbname [19:48:20] Grep labs or labswiki? [19:48:44] though subscribing to those didn't work [19:49:24] audephone: looked like it first, but it was just edits on enwiki that somehow contain the word "labs" [19:49:31] Oh [19:50:25] will have to look at the configuration and figure out how to debug from mediawiki or such [19:51:05] scraping irc is okay enough for now though is a bit ugly [19:51:28] yea, but you are also not the only one [19:51:50] i'm looking at more hints on rcs1001 [19:52:04] the redis code is not complicated [19:53:26] In php it is $redis = new Redis(); [19:53:56] does rcstream need to know the list of wikis? [19:54:02] or does it not care at all [19:54:26] Then redis connect wth the URL and Port [19:54:38] I don't think it needs to know [19:55:04] because i was looking for a list of wikis or db names or so.. and no doesnt look like there is [19:55:11] a ticket would be good at this point [19:55:17] Redis connect should return true or false depending if it can connect [19:55:38] well, redis is running [19:55:48] i see that, just not data from wikitech [19:55:58] Hmmm [19:56:13] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-5/2/3 (Zayo, OGYX/120003//ZYO, 36ms) {#11519} [10Gbps wave]BR [19:58:32] ok, I should probably go home soon [19:59:28] audephone: it's somewhere in rcs1001 itself but before it gets to redis, i also need to go, but we can continue another time [20:00:02] ok, thanks [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160525T2000). Please do the needful. [20:00:22] i can definitely see packets coming in in the moment i make edits [20:00:25] ttyl! [20:00:31] no mobileapps deploy today [20:01:04] ok [20:02:13] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2327874 (10Eevans) I think the conversion is technically complete; There are more instances to bootstrap, but... [20:03:34] 06Operations, 10cassandra: Grafana bugginess; Graph scales sometimes off by an order of magnitude - https://phabricator.wikimedia.org/T121789#2327879 (10Eevans) [20:05:18] (03PS1) 10Mobrovac: Change Prop: Purge RESTBase re-renders [puppet] - 10https://gerrit.wikimedia.org/r/290748 [20:07:08] 06Operations, 10Parsoid, 06Services: Switch Parsoid to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T125017#2327897 (10hashar) [20:07:44] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:07:54] PROBLEM - Restbase root url on restbase1010 is CRITICAL: Connection refused [20:08:44] 06Operations, 06Services, 07Tracking: Move Node.JS services to Jessie and Node 4 (tracking) - https://phabricator.wikimedia.org/T124989#2327904 (10hashar) [20:10:41] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: connect usb external disk to labmon1001 - https://phabricator.wikimedia.org/T136242#2327910 (10RobH) [20:11:53] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [20:11:54] RECOVERY - Restbase root url on restbase1010 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.009 second response time [20:15:55] (03PS2) 10Mobrovac: Change Prop: Purge RESTBase re-renders [puppet] - 10https://gerrit.wikimedia.org/r/290748 [20:23:12] (03PS2) 10Gehel: WIP experiments, just keeping that safe somewhere... [puppet] - 10https://gerrit.wikimedia.org/r/290631 [20:24:13] (03Abandoned) 10Gehel: WIP experiments, just keeping that safe somewhere... [puppet] - 10https://gerrit.wikimedia.org/r/290631 (owner: 10Gehel) [20:33:37] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2327972 (10Cmjohnson) @Jgreen We are out of available SRX switch ports to add this new db....do you want to decom anything? [20:34:12] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: eqiad: Rack and setup new labstore - https://phabricator.wikimedia.org/T133397#2327973 (10Cmjohnson) 05Open>03Resolved This has been completed. [20:50:03] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/2916/" [puppet] - 10https://gerrit.wikimedia.org/r/290748 (owner: 10Mobrovac) [20:56:39] (03PS1) 10Mobrovac: RESTBase: Remove purging config [puppet] - 10https://gerrit.wikimedia.org/r/290786 [20:58:01] 06Operations, 06Labs, 10Labs-Infrastructure: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2328038 (10Dzahn) [20:59:15] twentyafterfour: here now. Are you still deploying? https://gerrit.wikimedia.org/r/#/c/290789/ is the cherry-pick that needs to be deployed [21:00:39] otherwise I can sync it out [21:01:06] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/2917/" [puppet] - 10https://gerrit.wikimedia.org/r/290786 (owner: 10Mobrovac) [21:02:13] Afterwards wikidata can be re-updated AFAICT [21:03:07] 06Operations, 06Labs, 10Labs-Infrastructure: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2328038 (10Krinkle) RCStream doesn't use channels (unlike the RC messages we send over IRCD, though even there IRCD auto-creates any channels address mesagges at). It's one large "... [21:04:07] (03CR) 10Ppchelko: [C: 031] Change Prop: Purge RESTBase re-renders [puppet] - 10https://gerrit.wikimedia.org/r/290748 (owner: 10Mobrovac) [21:04:45] mutante: Did you check tcpdump on both sides? (e.g. also incoming on the redis machine) [21:05:25] Krinkle: actually only incoming on the rcs1001 where redis runs [21:05:41] and when i make an edit on wikitech, [21:05:46] i can see stuff come in [21:06:38] but in redis i cant see it [21:08:05] and then somehow it would have to be a thing that is different for wikitech because it's not like the other cluster wikis [21:08:19] but MW config looks the same to me when i checked [21:10:15] (03PS2) 10Yuvipanda: Add base PHP container & php web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290607 [21:10:32] it seemed so obvious that it was the firewalling issue before.. but it's open now [21:10:46] (03PS1) 10Eevans: enable instance restbase2005-b [puppet] - 10https://gerrit.wikimedia.org/r/290792 (https://phabricator.wikimedia.org/T134016) [21:10:52] (03PS3) 10Yuvipanda: Add base PHP container & php web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290607 [21:11:02] (03CR) 10jenkins-bot: [V: 04-1] RESTBase: Remove purging config [puppet] - 10https://gerrit.wikimedia.org/r/290786 (owner: 10Mobrovac) [21:11:12] legoktm: I can [21:11:33] legoktm: also, https://phabricator.wikimedia.org/T136124 [21:11:47] (03PS1) 10Yuvipanda: Add a simple builder script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 [21:12:03] valhallasw`cloud: bd808 ^ docker builder script helper for building a set of dockerfiles. [21:12:06] in python [21:12:06] I guess you pinged me about the same thing I pinged you about. nice [21:12:08] deploying [21:12:14] I thought of building it with a makefile and then realized 'NO' [21:12:51] (03CR) 10Ppchelko: [C: 031] RESTBase: Remove purging config [puppet] - 10https://gerrit.wikimedia.org/r/290786 (owner: 10Mobrovac) [21:14:20] (03PS1) 10Yuvipanda: Switch to using wikimedia-jessie as base container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290795 [21:14:29] 06Operations, 10cassandra: Assign 'c' instance IPs for restbase100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T136206#2326678 (10Dzahn) looks like this is already done meanwhile restbase1007-c.eqiad.wmnet has address 10.64.0.232 restbase1008-c.eqiad.wmnet has address 10.64.32.196 restbase1009-c.eq... [21:15:34] 06Operations, 10ops-codfw, 10RESTBase: plug in restbase2004 network cable - https://phabricator.wikimedia.org/T134197#2328162 (10RobH) 05Open>03Resolved I didn't get back to this task until today, but the system appears online and working (all green in icinga.) [21:18:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2328198 (10RobH) [21:18:50] (03PS1) 10Dzahn: assign 'c' IPs for restbase100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/290797 (https://phabricator.wikimedia.org/T136206) [21:20:06] valhallasw`cloud: so this php container only has php things, and is debian jessie too. I'm going to start with these kinds of containers and then build teh backwards compat ones later [21:20:14] nagf is currently running in this and seems ok [21:20:21] (03PS2) 10Dzahn: assign 'c' IPs for restbase100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/290797 (https://phabricator.wikimedia.org/T136206) [21:22:40] waiting for CI and then I'll deploy https://gerrit.wikimedia.org/r/#/c/290789/ and https://gerrit.wikimedia.org/r/#/c/290799/ [21:22:44] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 716 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6516083 keys - replication_delay is 716 [21:23:35] (03PS1) 10Eevans: stub out missing 'c' instances [puppet] - 10https://gerrit.wikimedia.org/r/290800 (https://phabricator.wikimedia.org/T136206) [21:24:25] (03CR) 10Dzahn: "see https://gerrit.wikimedia.org/r/#/c/290797/" [puppet] - 10https://gerrit.wikimedia.org/r/290800 (https://phabricator.wikimedia.org/T136206) (owner: 10Eevans) [21:25:37] (03CR) 1020after4: "apparently dumpsdeploy isn't right after all?" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [21:26:39] (03CR) 10Eevans: "These new sections all need to be commented out until we're ready to bootstrap the respective instances, but +1 otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/290797 (https://phabricator.wikimedia.org/T136206) (owner: 10Dzahn) [21:27:31] (03CR) 10Dzahn: "in labs/private it's "deploy_service" , and renamed in https://gerrit.wikimedia.org/r/#/c/290718/ .. let's just use the same name :p" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [21:28:35] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6477027 keys - replication_delay is 0 [21:30:27] (03PS1) 10Dzahn: add fake dumpsdeploy key pair [labs/private] - 10https://gerrit.wikimedia.org/r/290804 [21:31:00] (03CR) 10Dzahn: [C: 032] add fake dumpsdeploy key pair [labs/private] - 10https://gerrit.wikimedia.org/r/290804 (owner: 10Dzahn) [21:31:09] (03CR) 10Dzahn: [V: 032] add fake dumpsdeploy key pair [labs/private] - 10https://gerrit.wikimedia.org/r/290804 (owner: 10Dzahn) [21:31:19] twentyafterfour: try "rebuild last" in compiler now [21:32:42] (03CR) 10Dzahn: "try rebuilding it one more time now (https://gerrit.wikimedia.org/r/290804)" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [21:33:26] 06Operations, 06Labs, 10Labs-Infrastructure: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2328279 (10Krenair) I don't have time to dig into this today but when I looked at `telnet rcs1001.eqiad.wmnet 6379` from silver earlier it would try to IPv6 for a minute and fail, t... [21:34:46] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2328291 (10BBlack) `lg_dirty_mult:6` doesn't seem to be making a very big impact on its own so far, so I'm changing the cp3048 experiment to `lg_dirty_mult:6,lg_chunk:20` (... [21:36:37] (03PS2) 10Yuvipanda: Switch to using wikimedia-jessie as base container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290795 [21:36:39] (03PS2) 10Yuvipanda: Add a simple builder script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 [21:39:49] (03CR) 10Dzahn: "compiles now: http://puppet-compiler.wmflabs.org/2920/tin.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [21:39:54] twentyafterfour: http://puppet-compiler.wmflabs.org/2920/tin.eqiad.wmnet/ [21:46:00] twentyafterfour: legoktm: Will anyone take care of syncing the core change? [21:46:49] * aude waiting to get my sms notification :) [21:49:40] 06Operations, 06Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2328364 (10RobH) [21:51:00] hoo: yes [21:51:05] I'm doing it now [21:51:10] thanks [21:52:55] !log syncing https://gerrit.wikimedia.org/r/#/c/290789/ and https://gerrit.wikimedia.org/r/#/c/290799/ [21:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:56:52] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2328395 (10RobH) p:05Triage>03Normal [21:58:15] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2328397 (10RobH) [22:00:02] damnit scap [22:00:15] (03CR) 10EBernhardson: [C: 031] "this looks sane to me. Not sure about the effect on non-elasticsearch servers" [puppet] - 10https://gerrit.wikimedia.org/r/290487 (owner: 10Gehel) [22:00:53] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.3: (no message) (duration: 08m 18s) [22:01:04] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:20] 06Operations, 10ops-eqiad: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2328406 (10Ottomata) We are actually getting close to ready for this. Puppet development work got a little complicated. what does 'update install_server module' mean? [22:02:55] (03PS1) 1020after4: wikidata back to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290814 [22:03:12] (03CR) 1020after4: [C: 032] wikidata back to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290814 (owner: 1020after4) [22:03:14] PROBLEM - Apache HTTP on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:19] twentyafterfour: Confirmed fixed on test.wikidata, btw [22:03:21] (03CR) 1020after4: [V: 032] wikidata back to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290814 (owner: 1020after4) [22:03:48] (03Merged) 10jenkins-bot: wikidata back to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290814 (owner: 1020after4) [22:04:14] PROBLEM - HHVM rendering on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:04:26] hoo: Thanks! [22:04:43] !log re-re-verting wikidata back to wmf.3 [22:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:24] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: wikidata back to 1.28.0-wmf.3 [22:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:39] Confirmed still working [22:06:08] * aude got sms :) [22:06:12] o/ [22:06:18] thanks twentyafterfour [22:06:41] aude, hoo, no problem! :) [22:08:49] (03PS29) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [22:09:14] (03CR) 1020after4: [C: 031] service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [22:11:48] (03CR) 1020after4: [C: 031] "this looks good, however, once https://gerrit.wikimedia.org/r/#/c/289236/ lands you should add the ores scap::source to hieradata/common/s" [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [22:17:59] (03CR) 10BryanDavis: "The script seems pretty straight forward and simple. Initially I wondered if it should really be using a "proper build system" of some sor" (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 (owner: 10Yuvipanda) [22:22:54] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [22:25:10] (03PS13) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [22:45:51] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2328574 (10Jgreen) Yeah I thought that might happen, which is why I was hoping for a resolution to the SRX linecard issue ASAP. The only host that can be removed at this time is alum... [22:57:31] (03CR) 10GWicke: [C: 031] Change Prop: Purge RESTBase re-renders [puppet] - 10https://gerrit.wikimedia.org/r/290748 (owner: 10Mobrovac) [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160525T2300). [23:00:04] foks: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:03:31] l [23:03:35] Uh, oops [23:03:35] Hello. [23:03:46] (03CR) 10Dzahn: [C: 032] "only style changes and in ./tests/" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/289979 (owner: 10Dzahn) [23:04:21] Let's me check the SAL to see if tyler did a full SWAT this morning. [23:04:38] I did SWAT. I didn't run a full scap today [23:04:44] Dereckson: ^ [23:04:59] Okay. [23:07:36] (03PS1) 10Dzahn: zookeeper: bump submodule [puppet] - 10https://gerrit.wikimedia.org/r/290823 [23:08:17] thcipriani: so yesterday, I started manually a l10nupdate process, it failed at final step, same for the automated process at 2am UTC. But the changes are merged in the wmf2 and wmf3 branches. What do you think to achieve the process pulling the branches in /srv/mediawiki staging and do a full scap? [23:08:42] Another possibility: I can only do that for WikimediaMessages, the extension interesting for foks. [23:10:20] The first solution would have the advantage to ensure wmf branches = code actually deployed to the server, the second to see if l10nupdate process heals itself now mwxxxx issues are solved at 2am. [23:10:29] (03CR) 10Ladsgroup: "One thing I don't understand is that "Scap::target" is not being declared explicitly. I tried to change scap::target key_name parameter to" [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [23:10:31] I would probably pull everything into /srv/mediawiki-staging/php-1.28.0-wmf.{2,3} and run scap [23:10:43] * Dereckson nods [23:11:21] are the l10nupdate failures tracked in phab? [23:12:16] I filled a bug for a deprecated scap subcommand, we need one for "l10nupdate is hanging if one mw server isn't responsive at sync time" [23:13:03] That's a general scap problem but typically can be handled by the operator [23:14:31] guys, i merged this earlier [23:14:33] https://gerrit.wikimedia.org/r/#/c/289886/ [23:14:44] Switch from calling refreshCdbJsonFiles to the 'scap cdb-json-refresh' [23:14:58] isnt that the deprecated sub command you mean [23:15:04] yeah [23:15:11] so, should be fixed? [23:15:31] hopefully, yes [23:16:00] that was mostly just an annoyance though. the indefinate hangs for unresponsive hosts is a bigger deal [23:16:10] aha! ok [23:16:24] and thanks for the merge on that :) [23:17:51] (03Abandoned) 10Dereckson: l10nupdate: use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/290620 (https://phabricator.wikimedia.org/T136157) (owner: 10Dereckson) [23:18:49] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2328633 (10Dereckson) [23:18:55] (03CR) 10Dzahn: [C: 032] zookeeper: bump submodule [puppet] - 10https://gerrit.wikimedia.org/r/290823 (owner: 10Dzahn) [23:19:14] yw [23:20:29] Okay, I was wrong in my assertion, l10nupdate only merged in branches @ /var/lib/l10nupdate/ [23:20:47] wmf branches are clean [23:22:48] So let's run l10nupdate as this evening all is fine for mwxxxx. [23:26:16] (03CR) 10Dzahn: [C: 032] kafka: fix lint warnings [puppet/kafka] - 10https://gerrit.wikimedia.org/r/289980 (owner: 10Dzahn) [23:26:55] https://phabricator.wikimedia.org/T136255 for the l10nupdate/scap hanging task [23:27:41] (03PS1) 10Dzahn: kafka: bump submodule for lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/290827 [23:31:24] (03CR) 10Dzahn: [C: 032] kafka: bump submodule for lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/290827 (owner: 10Dzahn) [23:32:30] (03PS2) 10Dzahn: install_server: move mirrors stuff to own role [puppet] - 10https://gerrit.wikimedia.org/r/284809 (https://phabricator.wikimedia.org/T132757) [23:34:49] (03PS3) 10Dzahn: install_server: move mirrors stuff to own role [puppet] - 10https://gerrit.wikimedia.org/r/284809 (https://phabricator.wikimedia.org/T132757) [23:35:38] mutante: could you merge https://gerrit.wikimedia.org/r/#/c/290792/ for me? it's like the others, it'll start a cassandra bootstrap [23:35:51] RECOVERY - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is OK: TCP OK - 0.034 second response time on port 9042 [23:36:43] (03PS2) 10Dzahn: enable instance restbase2005-b [puppet] - 10https://gerrit.wikimedia.org/r/290792 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [23:36:51] \o/ [23:36:56] urandom: no problem, actually we got an announcement about this [23:37:01] "You might get pinged by the Services team to merge a patch similar to decommenting a single hieradata/hosts/restbase*.yaml file"" :) [23:37:11] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:37:17] mutante: great; it helps! [23:37:35] (03CR) 10Dzahn: [C: 032] enable instance restbase2005-b [puppet] - 10https://gerrit.wikimedia.org/r/290792 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [23:39:04] 23:35:53 Updated 392 JSON file(s) in /srv/mediawiki-staging/php-1.28.0-wmf.2/cache/l10n [23:39:06] Syncing to Apaches at 2016-05-25 23:35:53+00:00 [23:40:38] (03PS4) 10Dzahn: install_server: move mirrors stuff to own role [puppet] - 10https://gerrit.wikimedia.org/r/284809 (https://phabricator.wikimedia.org/T132757) [23:44:37] !log Bootstrapping restbase2005-b.codfw.wmnet : T134016 [23:44:38] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [23:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:06] !log dereckson@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 10m 12s) [23:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:26] Rebuilding localization cache at 2016-05-25 23:46:16+00:00 [23:46:39] for 1.28.0-wmf.3 now [23:56:01] (03CR) 10Dzahn: [C: 032] install_server: move mirrors stuff to own role [puppet] - 10https://gerrit.wikimedia.org/r/284809 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)