[00:10:59] 6operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#1940286 (10matmarex) After every ICU upgrade, we need to run a long-running (takes a few days on largest wikis, IIRC) maintenance script on a couple dozen wikis (see task description), which is why w... [00:18:17] RECOVERY - puppet last run on mw2039 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [00:20:37] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#1940296 (10Jalexander) Also adding: Callanecc Key: https://en.wikipedia.org/wiki/User:Callanecc/Key [01:30:32] (03PS1) 10Andrew Bogott: Add aliases for labtest websites: [dns] - 10https://gerrit.wikimedia.org/r/264703 [01:34:33] (03CR) 10Andrew Bogott: [C: 032] Add aliases for labtest websites: [dns] - 10https://gerrit.wikimedia.org/r/264703 (owner: 10Andrew Bogott) [01:42:49] (03PS1) 10Andrew Bogott: Specify webserver_hostname for the openstack_manager class. [puppet] - 10https://gerrit.wikimedia.org/r/264704 [01:43:47] (03CR) 10jenkins-bot: [V: 04-1] Specify webserver_hostname for the openstack_manager class. [puppet] - 10https://gerrit.wikimedia.org/r/264704 (owner: 10Andrew Bogott) [01:45:22] (03PS2) 10Andrew Bogott: Specify webserver_hostname for the openstack_manager class. [puppet] - 10https://gerrit.wikimedia.org/r/264704 [01:47:48] (03CR) 10Andrew Bogott: [C: 032] Specify webserver_hostname for the openstack_manager class. [puppet] - 10https://gerrit.wikimedia.org/r/264704 (owner: 10Andrew Bogott) [01:50:00] (03PS1) 10Andrew Bogott: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/264705 [01:51:10] (03CR) 10Andrew Bogott: [C: 032] Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/264705 (owner: 10Andrew Bogott) [01:54:38] (03PS1) 10Andrew Bogott: Add labtestwikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/264706 [01:55:52] (03CR) 10Andrew Bogott: [C: 032] Add labtestwikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/264706 (owner: 10Andrew Bogott) [02:19:45] 6operations, 7Mail: remove exim alias feedbacktest@ - https://phabricator.wikimedia.org/T123665#1940338 (10Jdlrobson) I'd assume so. I believe this is the old feedback form we used to run on web (which was mostly noise) [02:30:21] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 11m 38s) [02:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:49:09] !log updated annualreport for foks [02:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:50:33] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 08m 39s) [02:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 18 02:57:41 UTC 2016 (duration 7m 8s) [02:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:48] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [03:07:16] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#1940361 (10yuvipanda) I've done all these things and @Jalexander is verifying and GPG-mailing the passwords to these people. If only we didn't use software from the early 90s... [03:27:18] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:44:17] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: Puppet has 1 failures [04:09:29] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [04:36:17] PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:08] RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 69433 bytes in 0.081 second response time [04:53:12] 7Puppet, 10MediaWiki-Vagrant, 7Easy, 5Patch-For-Review: MediaWiki-Vagrant guest OS clock gets out of sync - https://phabricator.wikimedia.org/T116507#1940437 (10Tgr) That links seems to be about running the NTP daemon on the guest, but if Virtualbox can do that, that's certainly more convenient. [06:31:37] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:08] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:08] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:48] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:38] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:52] (03PS1) 10KartikMistry: cxserver: Enable all source languages for Yandex [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) [06:56:38] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:47] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:48] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:14:56] (03PS2) 10KartikMistry: cxserver: Enable all source languages for Yandex [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) [08:00:04] Deploy window US Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160118T0800) [08:53:48] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [08:57:48] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /a 326801 MB (3% inode=99%) [08:57:58] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [09:02:09] RECOVERY - Disk space on stat1002 is OK: DISK OK [09:05:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 640 [09:15:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 500150 Threads: 2 Questions: 3742864 Slow queries: 3079 Opens: 1350 Flush tables: 2 Open tables: 396 Queries per second avg: 7.483 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:57:59] ACKNOWLEDGEMENT - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdl failed T123830 [10:13:18] RECOVERY - RAID on ms-be2007 is OK: OK: optimal, 14 logical, 14 physical [10:13:20] RECOVERY - swift-object-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [10:13:20] RECOVERY - swift-container-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [10:13:20] RECOVERY - dhclient process on ms-be2007 is OK: PROCS OK: 0 processes with command name dhclient [10:13:57] RECOVERY - swift-container-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [10:13:58] RECOVERY - Check size of conntrack table on ms-be2007 is OK: OK: nf_conntrack is 0 % full [10:13:58] RECOVERY - swift-container-server on ms-be2007 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [10:13:58] RECOVERY - DPKG on ms-be2007 is OK: All packages OK [10:14:47] RECOVERY - configured eth on ms-be2007 is OK: OK - interfaces up [10:14:48] RECOVERY - swift-account-server on ms-be2007 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [10:14:48] RECOVERY - swift-object-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [10:14:48] RECOVERY - Disk space on ms-be2007 is OK: DISK OK [10:15:17] RECOVERY - very high load average likely xfs on ms-be2007 is OK: OK - load average: 7.76, 2.95, 1.30 [10:15:17] RECOVERY - swift-container-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:15:57] RECOVERY - swift-account-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [10:16:08] RECOVERY - swift-account-reaper on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:16:08] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:16:37] RECOVERY - swift-object-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [10:16:42] !log dist-upgrade ms-be3002 to trusty [10:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:16:57] RECOVERY - swift-account-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [10:16:58] RECOVERY - swift-object-server on ms-be2007 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [10:18:58] PROBLEM - RAID on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:38] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:18] PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CRITICAL - load average: 416.85, 300.17, 141.92 [10:21:38] RECOVERY - Swift HTTP backend on ms-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.084 second response time [10:28:08] PROBLEM - Swift HTTP backend on ms-fe2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:30:27] RECOVERY - Swift HTTP backend on ms-fe2004 is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.095 second response time [10:30:48] RECOVERY - salt-minion processes on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:32:38] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:33:48] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:41:57] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:42:51] !log powercycle ms-be2016, high load avg [10:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:45:48] PROBLEM - Host ms-be2016 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:38] RECOVERY - Host ms-be2016 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms [10:47:39] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [10:48:49] RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 7.41, 3.62, 1.43 [10:48:57] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 57 minutes ago with 0 failures [10:49:29] RECOVERY - RAID on ms-be2016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [11:08:09] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:36:37] (03PS3) 10KartikMistry: cxserver: Add all available source languages for Russian in Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) [11:36:56] (03PS4) 10KartikMistry: cxserver: Add all available source languages for Russian in Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) [11:42:02] (03Draft1) 10Addshore: wgRCWatchCategoryMembership true on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264732 [11:51:26] (03Draft1) 10Addshore: wgRCWatchCategoryMembership true on wikipedias & commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264733 [11:58:16] 6operations, 7Swift: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918#1940833 (10fgiunchedi) 3NEW [12:13:08] RECOVERY - swift-account-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:13:28] RECOVERY - swift-container-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:13:29] RECOVERY - swift-container-updater on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:13:29] RECOVERY - swift-container-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:13:29] RECOVERY - swift-container-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:14:27] RECOVERY - swift-object-auditor on ms-be3001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:14:27] RECOVERY - swift-account-reaper on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:14:27] RECOVERY - swift-account-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:56:25] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Correct HTML code for WMF image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo) [13:09:28] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /a 329644 MB (3% inode=99%) [13:57:27] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:27] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:58:47] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:58:59] PROBLEM - puppet last run on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:28] PROBLEM - configured eth on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:28] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:48] PROBLEM - DPKG on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:08] PROBLEM - RAID on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:37] PROBLEM - salt-minion processes on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:08:47] RECOVERY - salt-minion processes on praseodymium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:09:17] PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused [14:14:43] looks like praseodymium is down, looking [14:15:18] PROBLEM - salt-minion processes on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:19:28] PROBLEM - Check size of conntrack table on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:19:36] mhh acpi_pad kernel thread using a lot of cpu, never seen this before [14:22:57] PROBLEM - dhclient process on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:57] RECOVERY - Check size of conntrack table on praseodymium is OK: OK: nf_conntrack is 0 % full [14:24:17] PROBLEM - Disk space on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:24:44] !log powercycle praseodymium [14:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:58] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:28:45] 6operations: acpi_pad runaway processes on praseodymium - https://phabricator.wikimedia.org/T123924#1941039 (10fgiunchedi) [14:29:37] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [14:32:17] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [14:32:58] RECOVERY - salt-minion processes on praseodymium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:33:07] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [14:33:17] RECOVERY - Disk space on praseodymium is OK: DISK OK [14:33:37] RECOVERY - DPKG on praseodymium is OK: All packages OK [14:34:08] RECOVERY - dhclient process on praseodymium is OK: PROCS OK: 0 processes with command name dhclient [14:34:47] RECOVERY - RAID on praseodymium is OK: OK: no disks configured for RAID [14:41:38] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:42:52] (03PS2) 10Ottomata: Remove MobileWebSectionUsage from blacklist [puppet] - 10https://gerrit.wikimedia.org/r/264465 (owner: 10Milimetric) [14:43:39] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:50:27] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:51:08] RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.042 second response time [14:52:29] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:52:37] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [14:52:57] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [14:57:23] 6operations: acpi_pad runaway processes on praseodymium - https://phabricator.wikimedia.org/T123924#1941098 (10fgiunchedi) I've "fixed" it by turning off and on "logical processor" in bios (i.e. hyperthreading) without needing to physically drain power. I'm not sure about the exact functions of acpi_pad but we c... [15:20:49] akosiaris: say, https://etherpad.wikimedia.org/p/ReadingWebQ3Planning doesn't seem to be loading, whereas other etherpads are loading and can be started. are you able to get at https://etherpad.wikimedia.org/p/ReadingWebQ3Planning or able to extract its contents and email them to me? [15:27:05] anybody able to take a look at the etherpad issue? there's apparently a websocket 400 going on. cc phuedx [15:27:09] cc bd808 [15:38:40] (03PS1) 10Ottomata: Use hiera to configure hive and oozie server hostnames [puppet] - 10https://gerrit.wikimedia.org/r/264742 (https://phabricator.wikimedia.org/T110090) [15:45:31] dr0ptp4kt: I am taking a look, but as far as the extracting stuff out of an etherpad manually... I doubt it will make ansy sense to you [15:45:40] it looks like the 400 WS error also happens on pads working properly [15:45:43] it's storing json changesets [15:46:02] akosiaris: k. json changesets, woah. [15:46:04] Invalid changeset (checkRep failed) [15:46:11] so, corrupt pad [15:46:51] ema: the websocket thing is meant to fail, we don't support websockets yet [15:47:14] akosiaris: got it. is it possible to restore it to the latest non-corrupt state? [15:48:13] dr0ptp4kt: maybe... depends on how the bad changeset is stored [15:48:23] not promising anything though... it's a long shot [15:48:31] akosiaris: understood [15:52:11] 6operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10GitHub-Mirrors, and 3 others: [Task] Redirect unused extensions/ValueView repository to data-values/value-view - https://phabricator.wikimedia.org/T123624#1941263 (10thiemowmde) Personally I very much prefer redirects because they do not lea... [15:52:45] (03PS1) 10WMDE-leszek: Phragile: Ensure clone before creating storage dir [puppet] - 10https://gerrit.wikimedia.org/r/264745 [16:03:22] bd808: phuedx ^ akosiaris is looking into restoration to a non-corrupted pad state [16:03:50] cool [16:03:50] ta [16:04:14] dr0ptp4kt: maybe we can submit a FOIA request to the NSA for the contents of the pad ;) [16:04:45] bd808: KEYWORD_DETECTION_ACTIVATED [16:04:50] 10018 revisions ? sigh... [16:05:07] or so I think at least... maybe more [16:06:37] akosiaris: i wonder if somebody vandalized it. the last major edits were friday, jan 15 0100 utc approximately [16:06:54] (legitimate edits) [16:07:48] (03CR) 10Ottomata: [C: 032] Use hiera to configure hive and oozie server hostnames [puppet] - 10https://gerrit.wikimedia.org/r/264742 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [16:07:53] dr0ptp4kt: well, vandalize is not a good word for this. Somehow, someone, most probably not maliciously, triggered a bug and corrupted a pad. That's not the first time and given etherpad's track record it will not be the last [16:08:09] akosiaris: /me cries [16:08:25] look at this for example [16:08:27] pad:ReadingWebQ3Planning:revs:10018 | {"changeset":"Z:hbb<9|9e=f9m-a*2+1$O","meta":{"author":"a.CM2Hc5CAvS1dZuDw","timestamp":1452819512958}} [16:09:06] that's revision 10018 ... just a change from the previous one I assume.... which one actually has useful data in it ... still looking [16:09:15] dr0ptp4kt: https://etherpad.wikimedia.org/p/ReadingWebQ3Planning/timeslider#10018 [16:09:30] wow, that works ? [16:09:32] bd808: nice [16:09:41] I did not expect the timeslider to work [16:10:23] bd808: nice! [16:10:55] akosiaris: thx also for looking into it. no point mucking with it any further. i've saved off a copy [16:12:55] dr0ptp4kt: ok, I 'll much around a bit more with it in case I find something... this is interesting... the freaking timeslider working but not the pad itself is weird [16:13:05] oh and that changeset looks like it's the last one [16:13:17] can't find anything after that [16:15:15] heh, https://github.com/ether/etherpad-lite/issues/2107 [16:15:26] still open, 1,5 year later [16:15:56] akosiaris: that's the bug that made me try timeslider [16:17:05] there is a dump&reimport script that some folk had success with -- https://github.com/ether/etherpad-lite/pull/2210/files -- but it requires taking the whole server offline [16:18:14] that's probably to guarantee the transaction [16:18:54] I 've tried it .. no change [16:23:13] akosiaris: :) [16:33:58] (03CR) 10Subramanya Sastry: "integration/visualdiff has now been instantiated ... this is ready for review and merge. I can then apply on ruthenium and test it." [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [16:45:58] (03CR) 10Anomie: "I put this on SWAT for tomorrow morning, FYI." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie) [16:50:42] (03CR) 10Bartosz Dziewoński: "(typo: "dupliacte")" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie) [16:54:35] (03CR) 10Nuria: "Thanks for doing this. Did you re-started EL so changes take effect?" [puppet] - 10https://gerrit.wikimedia.org/r/264465 (owner: 10Milimetric) [17:00:13] (03PS1) 10Luke081515: Apply global blocks for meta, not for deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264758 (https://phabricator.wikimedia.org/T123936) [17:04:30] (03PS3) 10Anomie: Centralize and add rights and grants in preparation for grants moving into core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 [17:04:43] (03PS2) 10Luke081515: Apply global blocks at meta, not at deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264758 (https://phabricator.wikimedia.org/T123936) [17:06:40] (03PS1) 10Ottomata: Move hive-server, hive-metastore and oozie from analytics1027 to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/264760 (https://phabricator.wikimedia.org/T110090) [17:06:47] (03CR) 10Luke081515: [C: 031] Centralize and add rights and grants in preparation for grants moving into core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie) [17:07:33] (03CR) 10jenkins-bot: [V: 04-1] Move hive-server, hive-metastore and oozie from analytics1027 to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/264760 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [17:27:07] (03CR) 10Alex Monk: [C: 032] "looks like global blocks in beta have come from deployment instead of meta for quite a while" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264758 (https://phabricator.wikimedia.org/T123936) (owner: 10Luke081515) [17:27:31] (03Merged) 10jenkins-bot: Apply global blocks at meta, not at deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264758 (https://phabricator.wikimedia.org/T123936) (owner: 10Luke081515) [17:30:31] (03PS2) 10Ottomata: Move hive-server, hive-metastore and oozie from analytics1027 to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/264760 (https://phabricator.wikimedia.org/T110090) [17:30:51] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/264758 - labs-only change (duration: 00m 36s) [17:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:59] Luke081515, ^ [17:31:29] (03CR) 10jenkins-bot: [V: 04-1] Move hive-server, hive-metastore and oozie from analytics1027 to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/264760 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [17:33:12] (03PS1) 10Jforrester: Enable VisualEditor by default for some other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264765 (https://phabricator.wikimedia.org/T116523) [17:34:02] Krenair: Thanks [17:34:32] (03CR) 10Jforrester: "Scheduled for 25 January." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264765 (https://phabricator.wikimedia.org/T116523) (owner: 10Jforrester) [17:51:59] 7Blocked-on-Operations, 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim erbium, gadolinium into spares - https://phabricator.wikimedia.org/T123029#1941591 (10akosiaris) Are those going to be reinstalled soon ? I see erbium still in DNS btw [17:54:36] 6operations, 10DBA, 5Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1941615 (10jcrespo) 5Open>3stalled [19:01:55] mobrovac: did anyone follow up with you about updating your contact info? [19:12:38] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:29] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 0.010 second response time [19:25:29] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail [19:40:08] PROBLEM - very high load average likely xfs on ms-be1001 is CRITICAL: CRITICAL - load average: 234.71, 171.11, 86.19 [19:52:59] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:06:36] (03PS1) 10Andrew Bogott: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 [20:08:16] (03PS2) 10Andrew Bogott: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 [20:24:55] hm, what happened to rb on praseodymium today? [20:25:19] Oh God we still have prasdfjsdlkfjdslkfium? [20:25:42] I was talking about that box just the other day as an example of terrible server names from the past [20:25:46] But now I see it's not from the past [20:31:15] RoanKattouw: i don't have a problem with it, as long as my shell can auto-complete it :P [20:47:42] (03CR) 10Alex Monk: [C: 04-1] "Shouldn't this be labstestwiki, with an s?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott) [20:53:31] (03CR) 10Alex Monk: Add labtestwiki, a testing version of labswiki (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott) [20:54:59] (03CR) 10Alex Monk: Add labtestwiki, a testing version of labswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott) [20:55:35] Hey, is there anyone on the ops team that can help me out? [20:55:54] 6operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1942024 (10mobrovac) [20:56:24] josephine: I'm not ops, but if you say what you need I may know who can help [20:57:37] RoanKattouw: I created a google group for someone, but I keep getting a bounceback email every time I try to send a "Test" email to it :/ [20:58:33] josephine: Hmm, I think the person I'd ask about that is mutante but I'm not sure if he's around today [20:59:04] andrewbogott is clearly active but I don't know if he knows anything about email stuff [20:59:14] RoanKattouw: Yea, I thought it was a long shot since its a holiday today. [21:00:02] RoanKattouw: I'll just file a ticket on phab :) [21:00:15] Yeah that works [21:00:32] It's pretty quiet here in the office [21:06:47] 6operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1942098 (10Ottomata) > RESTBase is stateless per se, but it relies on Cassandra, for which a cross-DC repli... [21:09:15] 6operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1942101 (10Ottomata) If Services doesn't need events from eqiad to show up in codfw (and vice versa), then... [21:41:55] ori: Can you take a look at https://phabricator.wikimedia.org/T115584 ? [21:43:26] (03CR) 10Andrew Bogott: "'labtestwiki' is consistent with the 'labtest' domain and the host names which are all 'labtest*'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott) [21:45:05] (03CR) 10Alex Monk: "ok. also, ignore my comment earlier about multiversion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott) [21:50:05] (03PS3) 10Andrew Bogott: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 [22:20:43] (03PS4) 10Alex Monk: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott) [22:35:24] (03PS1) 10Andrew Bogott: Tidy up wiki crons for labtestwiki: [puppet] - 10https://gerrit.wikimedia.org/r/264882 [22:45:07] Luke081515|away: on it [22:45:46] (03CR) 10Alex Monk: [C: 031] Tidy up wiki crons for labtestwiki: [puppet] - 10https://gerrit.wikimedia.org/r/264882 (owner: 10Andrew Bogott) [22:48:59] (03PS2) 10Andrew Bogott: Tidy up wiki crons for labtestwiki: [puppet] - 10https://gerrit.wikimedia.org/r/264882 [22:50:30] (03CR) 10Andrew Bogott: [C: 032] Tidy up wiki crons for labtestwiki: [puppet] - 10https://gerrit.wikimedia.org/r/264882 (owner: 10Andrew Bogott) [22:51:26] andrewbogott, should I do the mediawiki-config change then? [22:51:43] Krenair: yes please, if you feel confident about it. [22:51:57] (03CR) 10Alex Monk: [C: 032] Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott) [22:52:21] (03Merged) 10jenkins-bot: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott) [22:53:33] !log krenair@tin Synchronized w/static/images/project-logos/wikitech.png: https://gerrit.wikimedia.org/r/#/c/264786/ (duration: 00m 31s) [22:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:55:02] !log krenair@tin Synchronized dblists: (no message) (duration: 00m 31s) [22:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:55:22] !log krenair@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [22:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:10] !log krenair@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/264786/ (duration: 00m 32s) [23:00:13] andrewbogott, uh... so that might have stopped normal wikitech from working [23:00:22] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:00:46] hm, so I see :( [23:01:49] but why... [23:02:38] I don’t think this is related, but I am loving this error message: "GC cache entry 'labswiki:cirrussearch-morelikethis-settings:2' was on gc-list for 3602 seconds in /srv/mediawiki/php-1.27.0-wmf.9/includes/libs/objectcache/APCBagOStuff.php" [23:05:12] Ohh, wait [23:05:15] Krenair: it’s looking for labswiki on db1029.eqiad.wmnet instead of on silver [23:05:21] I might know what this is [23:05:31] 'extension1' => array( [23:05:31] '10.64.16.18' => 10, # db1029 [23:05:31] '10.64.16.20' => 20, # db1031 snapshot host [23:05:36] it's looking for one of those two servers [23:06:12] So maybe this is wmgEchoCluster [23:06:53] Yep. [23:06:59] krenair@tin:/srv/mediawiki-staging (master)$ mwscript eval.php labswiki [23:06:59] > var_dump( $wmgEchoCluster ); [23:06:59] bool(false) [23:07:00] Whereas: [23:07:05] krenair@silver:~$ mwscript eval.php labswiki [23:07:05] > var_dump( $wmgEchoCluster ); [23:07:05] string(10) "extension1" [23:07:18] running sync-common on silver [23:07:59] I am not home atm but i see you guys are in the thick of it [23:08:10] you got a page about wikitech chasemp? [23:08:15] chasemp: two unrelated things I think [23:08:31] Krenair: tools paged at the same time, but it was back up by the time I looked. [23:08:31] Okay [23:08:36] NFS [23:08:36] It's back up [23:09:16] I'm not clear on why that fixed it [23:10:00] Krenair: https://labtestwikitech.wikimedia.org/wiki/ is still saying ‘no wiki found’ but I expect that’s because the db is empty? [23:11:22] No [23:11:33] It's because I haven't figured out how to sync files to the test server yet [23:12:05] oh? I ran ‘sync common’ there already and it seemed ok [23:12:09] That error is about being unable to map the host properly to something in all.dblist [23:12:12] but the db is still a mess [23:13:29] heh, it doesn’t even have mysql-server installed [23:13:40] Huh. [23:14:03] Oh, I know why. [23:14:28] Remember when I said to ignore my comment about multiversion? [23:14:38] Well... I was right in the first place [23:15:22] what was wrong about my entry? [23:15:24] simple command on wikitech to make that server sync: [23:15:25] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@labtestwikitech /srv/deployment/scap/scap/bin/sync-common [23:15:38] andrewbogott: you didn't add one [23:15:46] on tin* even [23:16:25] this is different from wikiversions.json I take it [23:16:35] yes [23:16:42] I'll upload the patch [23:16:48] ok, thanks [23:17:23] Krenair: this is unrelated to the ‘no database server’ thing right? [23:17:26] (that I already pushed, now it shows the DB error) [23:17:37] ok [23:17:38] PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: puppet fail [23:17:41] right, the no DB server error is because mysql is not installed [23:17:49] yeah, that seems bad [23:17:56] I guess it’s not puppetized properly on silver :( [23:18:03] surprise surprise :P [23:18:10] or it's from a different role [23:18:24] I guess that's the same thing, effectively [23:18:30] no, the roles match I think [23:19:23] andrewbogott, oh, before I forget: other puppet change needed is dealing with silver.yaml [23:19:32] so probably we need role::mariadb [23:21:22] (03PS1) 10Andrew Bogott: Allow deployer access to labtestweb [puppet] - 10https://gerrit.wikimedia.org/r/264894 [23:21:23] Krenair: Like ^ you mean? [23:22:18] I suppose that works, although ideally the admin::groups and apache::logrotate::rotate from both hosts would be moved out into a common file [23:22:24] and the node blocks in site.pp merged [23:23:02] this box isn’t going to == silver in the long run. I’m hoping to host horizon there too... [23:23:12] oh, ok [23:23:13] so I definitely don’t want to merge the node definitions [23:23:57] (03PS1) 10Alex Monk: labtestwikitech.wikimedia.org -> labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264895 [23:24:20] (03CR) 10Alex Monk: [C: 032] labtestwikitech.wikimedia.org -> labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264895 (owner: 10Alex Monk) [23:24:28] hey [23:24:31] andrewbogott: I got an NFS page [23:24:38] is everything ok? [23:24:52] (03Merged) 10jenkins-bot: labtestwikitech.wikimedia.org -> labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264895 (owner: 10Alex Monk) [23:24:54] seems ok except for things being fucking slow but that's business as usual [23:24:57] ok [23:25:43] YuviPanda: I think that nfs had a stutter which I haven’t investigated. [23:25:53] Alex and I are messing with wikitech but hopefully the ‘broken’ phase of that is over now [23:25:57] ok [23:26:18] (03PS1) 10Andrew Bogott: Include role::mariadb on silver and labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264896 [23:26:27] I think wikitech was fixed, btw [23:26:43] !log krenair@tin Synchronized multiversion/MWMultiVersion.php: https://gerrit.wikimedia.org/r/264895 (duration: 00m 31s) [23:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:16] well… I wish I could test 264896 with the puppet compiler [23:30:28] otherwise who knows what will happen to silver [23:31:22] why can't you test it with that? [23:31:41] andrewbogott: did you get a recovery SMS for labs? [23:31:45] for NFS [23:32:00] YuviPanda: I got no sms at all [23:32:01] there wasn't a recovery from icinga [23:33:08] ok [23:33:09] Krenair: because of https://puppet-compiler.wmflabs.org/1623/ which seems more like ‘puppet compiler can’t handle this’ than ‘actual error' [23:33:11] but I’m still reading [23:34:02] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162&nostatusheader still shows NFS as critical [23:34:07] Hm, actually, YuviPanda http://tools-checker.wmflabs.org/nfs/home [23:34:25] and indeed if I go to http://tools-checker.wmflabs.org/ I get 502 Bad Gateway [23:34:42] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.170 second response time [23:34:45] andrewbogott: yeah, I just fixed it [23:34:50] andrewbogott: tools checker was stuck [23:34:57] did you just restart nginx? [23:35:26] uwsgi [23:35:28] I know the cause too [23:35:32] webservice gets stuck sometimes [23:35:34] need to add timeouts [23:36:29] Krenair: I think it’s because we don’t allow exported resources in labs [23:36:57] k, I think this is beyond my level of puppet knowledge :p [23:37:38] I've a fix coming up [23:37:54] (03CR) 10Andrew Bogott: [C: 032] Allow deployer access to labtestweb [puppet] - 10https://gerrit.wikimedia.org/r/264894 (owner: 10Andrew Bogott) [23:38:10] (03PS2) 10Andrew Bogott: Include role::mariadb on labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264896 [23:39:45] (03CR) 10Andrew Bogott: [C: 032] Include role::mariadb on labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264896 (owner: 10Andrew Bogott) [23:40:08] (03PS1) 10Yuvipanda: tools: Setup timeouts for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/264898 (https://phabricator.wikimedia.org/T123987) [23:41:27] (03PS2) 10Yuvipanda: tools: Setup timeouts for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/264898 (https://phabricator.wikimedia.org/T123987) [23:42:19] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Setup timeouts for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/264898 (https://phabricator.wikimedia.org/T123987) (owner: 10Yuvipanda) [23:43:28] RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:44:37] andrewbogott: that should take care of the tools checker paging problem [23:44:45] great, thanks! [23:49:32] ok, I am doing something dumb… role::mariadb is included on labtestweb2001 but none of its packages are getting installed [23:52:16] (03PS2) 10Andrew Bogott: Added icinga check for 'showmount' on tools instances. [puppet] - 10https://gerrit.wikimedia.org/r/264049 (https://phabricator.wikimedia.org/T123588) [23:53:49] (03CR) 10Andrew Bogott: [C: 032] Added icinga check for 'showmount' on tools instances. [puppet] - 10https://gerrit.wikimedia.org/r/264049 (https://phabricator.wikimedia.org/T123588) (owner: 10Andrew Bogott) [23:58:35] ok… YuviPanda do you have time to explain my mistake? labtestweb2001 includes role::mariadb which includes mariadb which includes mariadb::packages which installs mariadb-server-5.5 [23:58:41] …so why isn’t that package on labtestweb2001? [23:59:18] looking [23:59:54] at least one reason is [23:59:56] > dist => 'precise-wikimedia', [23:59:59] and I bet you aren't on precise :)