[00:05:46] (03CR) 10BBlack: [C: 032] Wrap varnishkafka ganglia monitor with has_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/219775 (https://phabricator.wikimedia.org/T103278) (owner: 10Thcipriani) [00:33:10] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [00:34:00] PROBLEM - puppet last run on mw2151 is CRITICAL puppet fail [00:39:03] i'll restart gitblit on antimony ...again [00:44:50] !log restarted gitblit on antimony again [00:44:55] Logged the message, Master [00:45:21] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60557 bytes in 0.110 second response time [00:51:30] RECOVERY - puppet last run on mw2151 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [01:24:04] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom www.$lang hostnames/redirects - https://phabricator.wikimedia.org/T102815#1386400 (10BBlack) @Krinkle said elsewhere: > And by default the www.en.wikipedia entries are not included https://www.google.co.uk/search?q=site:wikipedia.org+intitle:%22ANAPR... [01:28:11] PROBLEM - RAID on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:40] PROBLEM - SSH on mw1106 is CRITICAL - Socket timeout after 10 seconds [01:29:40] PROBLEM - Apache HTTP on mw1106 is CRITICAL - Socket timeout after 10 seconds [01:29:51] PROBLEM - HHVM rendering on mw1106 is CRITICAL - Socket timeout after 10 seconds [01:30:01] PROBLEM - configured eth on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:01] PROBLEM - DPKG on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:10] PROBLEM - Disk space on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:11] PROBLEM - nutcracker port on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:20] PROBLEM - salt-minion processes on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:21] PROBLEM - dhclient process on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:21] PROBLEM - nutcracker process on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:30:41] PROBLEM - HHVM processes on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:31:11] RECOVERY - SSH on mw1106 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [01:31:32] RECOVERY - configured eth on mw1106 is OK - interfaces up [01:31:32] RECOVERY - DPKG on mw1106 is OK: All packages OK [01:31:40] RECOVERY - RAID on mw1106 is OK no RAID installed [01:31:40] RECOVERY - HHVM rendering on mw1106 is OK: HTTP OK: HTTP/1.1 200 OK - 71809 bytes in 7.608 second response time [01:31:41] RECOVERY - Disk space on mw1106 is OK: DISK OK [01:31:51] RECOVERY - nutcracker port on mw1106 is OK: TCP OK - 0.000 second response time on port 11212 [01:31:51] RECOVERY - dhclient process on mw1106 is OK: PROCS OK: 0 processes with command name dhclient [01:31:51] RECOVERY - salt-minion processes on mw1106 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:31:51] RECOVERY - nutcracker process on mw1106 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:32:20] RECOVERY - HHVM processes on mw1106 is OK: PROCS OK: 6 processes with command name hhvm [01:33:00] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [01:33:51] RECOVERY - puppet last run on mw1106 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:21:11] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:51] RECOVERY - Host mw1085 is UPING WARNING - Packet loss = 28%, RTA = 2.99 ms [02:26:58] !log l10nupdate Synchronized php-1.26wmf10/cache/l10n: (no message) (duration: 07m 27s) [02:27:11] Logged the message, Master [02:31:32] !log LocalisationUpdate completed (1.26wmf10) at 2015-06-22 02:31:32+00:00 [02:31:36] Logged the message, Master [04:31:26] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1386418 (10Mattflaschen) This could also be seen as a replacement for some use cases of MW hooks and hook listeners. Right now, if the hook... [05:01:02] not sure if this is the right place to ask but how do i verify the page I am writing is not the same as another similarly named page [05:01:03] ? [05:02:24] (03PS3) 10KartikMistry: Beta: Fix no-mt YAML syntax [puppet] - 10https://gerrit.wikimedia.org/r/219361 [05:11:23] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 22 05:11:22 UTC 2015 (duration 11m 21s) [05:11:27] Logged the message, Master [05:13:49] (03PS1) 10KartikMistry: CX: Enable Apertium Machine Translation for Simple English [puppet] - 10https://gerrit.wikimedia.org/r/219779 (https://phabricator.wikimedia.org/T103067) [05:16:31] 6operations, 10Beta-Cluster, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#1386438 (10Mattflaschen) [05:25:36] (03CR) 10Mattflaschen: "Currently, I can use foreachwikiindblist for any Echo maint script without errors caused by it not being enabled on that wiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139326 (https://phabricator.wikimedia.org/T59375) (owner: 10Withoutaname) [05:33:49] (03PS1) 10KartikMistry: CX: Set Hindi<->Urdu pairs to no-mt as defaults [puppet] - 10https://gerrit.wikimedia.org/r/219781 [05:41:20] PROBLEM - puppet last run on mw2061 is CRITICAL puppet fail [05:58:41] RECOVERY - puppet last run on mw2061 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:06:27] (03CR) 10Giuseppe Lavagetto: [C: 031] "I'll complete this a bit as I need it to go further with my work." [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [06:30:51] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [06:31:41] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [06:32:40] PROBLEM - puppet last run on mw1065 is CRITICAL Puppet has 1 failures [06:33:31] PROBLEM - puppet last run on cp2014 is CRITICAL Puppet has 1 failures [06:33:40] PROBLEM - puppet last run on db2036 is CRITICAL Puppet has 1 failures [06:33:41] PROBLEM - puppet last run on mw2107 is CRITICAL puppet fail [06:33:51] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:34:21] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:34:50] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:35:20] PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 1 failures [06:35:21] PROBLEM - YARN NodeManager Node-State on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:35:30] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:35:30] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 2 failures [06:35:31] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:35:32] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:35:32] PROBLEM - puppet last run on mw1166 is CRITICAL Puppet has 1 failures [06:35:41] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 2 failures [06:35:51] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:37:50] PROBLEM - Hadoop NodeManager on analytics1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:38:50] RECOVERY - YARN NodeManager Node-State on analytics1016 is OK YARN NodeManager analytics1016.eqiad.wmnet:8041 Node-State: RUNNING [06:43:03] (03PS2) 10Giuseppe Lavagetto: lvs: Add definitions for conftool [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [06:46:00] RECOVERY - puppet last run on db2036 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1065 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:47:13] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:47:40] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:41] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:41] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:48:00] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:00] RECOVERY - puppet last run on mw1166 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:00] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:01] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:22] RECOVERY - Hadoop NodeManager on analytics1016 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:51:21] RECOVERY - puppet last run on mw2107 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:14:21] PROBLEM - puppet last run on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:51] RECOVERY - puppet last run on mw1049 is OK Puppet is currently enabled, last run 22 minutes ago with 0 failures [07:42:31] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [07:44:53] (03PS3) 10Giuseppe Lavagetto: lvs: Add definitions for conftool [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [07:52:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: various robustness fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/219329 (owner: 10Giuseppe Lavagetto) [07:53:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: pep8 compliance, improve logging [software/conftool] - 10https://gerrit.wikimedia.org/r/219339 (owner: 10Giuseppe Lavagetto) [08:02:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] debian: new release [software/conftool] - 10https://gerrit.wikimedia.org/r/219343 (owner: 10Giuseppe Lavagetto) [08:08:50] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60558 bytes in 0.573 second response time [08:09:14] (03PS1) 10Muehlenhoff: Blacklist kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/219786 (https://phabricator.wikimedia.org/T102600) [08:09:57] (03CR) 10jenkins-bot: [V: 04-1] Blacklist kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/219786 (https://phabricator.wikimedia.org/T102600) (owner: 10Muehlenhoff) [08:15:31] (03CR) 10Addshore: rsync wikidata json dumps to labs /public/dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) (owner: 10Addshore) [08:15:45] (03PS4) 10Addshore: rsync wikidata json dumps to labs /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) [08:16:00] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [08:20:10] <_joe_> we're still missing the phab bot it seems [08:23:01] PROBLEM - RAID on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:23:01] PROBLEM - puppet last run on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:41] RECOVERY - RAID on mw1020 is OK no RAID installed [08:24:42] RECOVERY - puppet last run on mw1020 is OK Puppet is currently enabled, last run 19 minutes ago with 0 failures [08:26:38] (03CR) 10Alexandros Kosiaris: [C: 031] "Yup. https://github.com/puppetlabs/puppet/commit/30e115330b52d54dd47f6e117e0dd80f9f2e795a" [puppet] - 10https://gerrit.wikimedia.org/r/219600 (owner: 10Gage) [08:27:23] (03PS5) 10Addshore: rsync wikidata json dumps to labs /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) [08:27:33] (03PS1) 10Ori.livneh: Use cronolog and logrotate to avoid Puppetmaster Apache reloads [puppet] - 10https://gerrit.wikimedia.org/r/219788 [08:31:50] PROBLEM - puppet last run on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:41] PROBLEM - SSH on mw1020 is CRITICAL - Socket timeout after 10 seconds [08:33:10] PROBLEM - nutcracker port on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:10] PROBLEM - Disk space on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:30] PROBLEM - RAID on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:32] PROBLEM - HHVM processes on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:10] PROBLEM - Apache HTTP on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [08:34:11] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.021 second response time [08:34:20] RECOVERY - SSH on mw1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:34:41] RECOVERY - nutcracker port on mw1020 is OK: TCP OK - 0.000 second response time on port 11212 [08:34:42] RECOVERY - Disk space on mw1020 is OK: DISK OK [08:35:02] RECOVERY - RAID on mw1020 is OK no RAID installed [08:35:10] RECOVERY - puppet last run on mw1020 is OK Puppet is currently enabled, last run 5 minutes ago with 0 failures [08:35:11] RECOVERY - HHVM processes on mw1020 is OK: PROCS OK: 6 processes with command name hhvm [08:44:11] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60558 bytes in 0.092 second response time [08:58:37] (03PS2) 10Muehlenhoff: Blacklist kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/219786 (https://phabricator.wikimedia.org/T102600) [09:23:11] !log upgrading Jenkins gearman plugin from 0.1.1 to latest master (f2024bd). Restarting Jenkins. [09:23:15] Logged the message, Master [09:45:52] Does anyone know how stuff at wikimedia.org/static/images/project-logos is controlled? See https://phabricator.wikimedia.org/T103296 [09:46:20] 6operations, 10Graphoid, 6Services: Confine Graphoid with firejail - https://phabricator.wikimedia.org/T103095#1386705 (10mobrovac) @Yurik, could you provide a Graphoid URL that should return a valid PNG in deployment-prep so we can test? [09:50:01] <_joe_> addshore: not off the top of my head [09:58:11] PROBLEM - puppet last run on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:58:51] PROBLEM - SSH on mw1049 is CRITICAL - Socket timeout after 10 seconds [09:59:10] PROBLEM - RAID on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:10] PROBLEM - configured eth on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:01] PROBLEM - DPKG on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:30] PROBLEM - nutcracker port on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:30] PROBLEM - HHVM rendering on mw1049 is CRITICAL - Socket timeout after 10 seconds [10:00:31] PROBLEM - Apache HTTP on mw1049 is CRITICAL - Socket timeout after 10 seconds [10:00:41] PROBLEM - Disk space on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:41] PROBLEM - HHVM processes on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:41] PROBLEM - nutcracker process on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:41] PROBLEM - salt-minion processes on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:41] PROBLEM - dhclient process on mw1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:01:41] PROBLEM - RAID on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:01:50] RECOVERY - DPKG on mw1049 is OK: All packages OK [10:02:11] PROBLEM - SSH on mw1075 is CRITICAL - Socket timeout after 10 seconds [10:02:11] RECOVERY - nutcracker port on mw1049 is OK: TCP OK - 0.000 second response time on port 11212 [10:02:21] PROBLEM - puppet last run on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:02:30] PROBLEM - Apache HTTP on mw1075 is CRITICAL - Socket timeout after 10 seconds [10:02:42] PROBLEM - nutcracker process on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:02:51] PROBLEM - salt-minion processes on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:03:01] PROBLEM - configured eth on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:03:02] PROBLEM - HHVM rendering on mw1075 is CRITICAL - Socket timeout after 10 seconds [10:03:21] PROBLEM - Disk space on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:03:30] PROBLEM - DPKG on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:31] PROBLEM - nutcracker port on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:31] PROBLEM - dhclient process on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [10:05:51] RECOVERY - Disk space on mw1049 is OK: DISK OK [10:05:51] RECOVERY - salt-minion processes on mw1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:05:51] RECOVERY - HHVM processes on mw1049 is OK: PROCS OK: 6 processes with command name hhvm [10:05:51] RECOVERY - nutcracker process on mw1049 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:05:51] RECOVERY - SSH on mw1049 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [10:05:52] RECOVERY - dhclient process on mw1049 is OK: PROCS OK: 0 processes with command name dhclient [10:06:00] PROBLEM - HHVM processes on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:06:11] RECOVERY - RAID on mw1049 is OK no RAID installed [10:06:12] RECOVERY - configured eth on mw1049 is OK - interfaces up [10:07:01] RECOVERY - puppet last run on mw1049 is OK Puppet is currently enabled, last run 29 minutes ago with 0 failures [10:07:19] <_joe_> mh what the hell is going on? [10:07:21] RECOVERY - HHVM rendering on mw1049 is OK: HTTP OK: HTTP/1.1 200 OK - 70390 bytes in 0.129 second response time [10:07:30] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.028 second response time [10:08:54] (03PS1) 10Lokal Profil: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) [10:08:57] <_joe_> oh, the good ol memory leak [10:11:42] (03PS1) 10Muehlenhoff: Enable firejail for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/219801 (https://phabricator.wikimedia.org/T103095) [10:20:01] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:01] PROBLEM - DPKG on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:21] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:40] RECOVERY - RAID on mw1107 is OK no RAID installed [10:21:41] RECOVERY - DPKG on mw1107 is OK: All packages OK [10:26:31] PROBLEM - puppet last run on ms-be1017 is CRITICAL Puppet has 1 failures [10:27:31] RECOVERY - salt-minion processes on mw1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:27:40] RECOVERY - configured eth on mw1075 is OK - interfaces up [10:27:50] RECOVERY - Disk space on mw1075 is OK: DISK OK [10:27:53] 6operations, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1386822 (10Addshore) [10:28:01] RECOVERY - DPKG on mw1075 is OK: All packages OK [10:28:01] RECOVERY - RAID on mw1075 is OK no RAID installed [10:28:11] RECOVERY - nutcracker port on mw1075 is OK: TCP OK - 0.000 second response time on port 11212 [10:28:20] RECOVERY - dhclient process on mw1075 is OK: PROCS OK: 0 processes with command name dhclient [10:28:21] RECOVERY - SSH on mw1075 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [10:28:40] RECOVERY - HHVM processes on mw1075 is OK: PROCS OK: 6 processes with command name hhvm [10:28:40] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 49 minutes ago with 0 failures [10:29:01] RECOVERY - nutcracker process on mw1075 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:29:23] 6operations, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1386841 (10Addshore) So they are stored in operations-mediawiki-config and the logos are present there https://github.com/wikimedia/operations-mediawiki-c... [10:30:07] _joe_: found them, operations/mediawiki-config in w/static/images/project-logos/ [10:30:30] the logos are there, guess they just fell off the server hosting them.... I guess ops need to poke something somewhere :) [10:31:31] <_joe_> addshore: is there a bug for this? [10:31:38] https://phabricator.wikimedia.org/T103296 [10:31:43] <_joe_> ok thanks [10:31:44] just added operations to it [10:31:55] <_joe_> right it's there [10:32:42] <_joe_> I'll take a look in a few [10:32:47] awesome :) [10:33:02] <_joe_> addshore: first I have to finish something else though [10:33:06] no worries :0 [10:33:07] :) [10:33:51] PROBLEM - puppet last run on mw1075 is CRITICAL puppet fail [10:38:41] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:14] 6operations, 10Traffic: Package/backport openssl 1.0.2 + nginx 1.7.x or higher - https://phabricator.wikimedia.org/T96850#1386863 (10MoritzMuehlenhoff) nginx 1.9.2 is now in Debian unstable [10:54:00] (03PS4) 10Giuseppe Lavagetto: lvs: Add definitions for conftool [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [10:54:29] 6operations, 10ops-eqiad, 7Database: Disk issue on db1028 - https://phabricator.wikimedia.org/T103230#1386892 (10jcrespo) Log is full of self-correcting notices of 32-10: ``` seqNum: 0x00036853 Time: Mon Jun 22 09:24:25 2015 Code: 0x0000006e Class: 0 Locale: 0x02 Event Description: Corrected medium error d... [11:04:40] (03PS1) 10Jcrespo: Depool es1001 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219805 [11:06:30] <_joe_> !log restarting hhvm on the low-memory appservers (main and api) [11:06:35] Logged the message, Master [11:06:51] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.122 second response time [11:06:52] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 70418 bytes in 0.351 second response time [11:07:06] (03CR) 10Jcrespo: [C: 032] Depool es1001 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219805 (owner: 10Jcrespo) [11:08:01] ^I intend to apply the change now, asking because of the rolling restarts [11:08:55] <_joe_> jynus: well, wait 5 mins if possible [11:09:04] ok, np [11:09:15] <_joe_> or - better - I'll stop for now, and do the rest in ~ 1 hour [11:09:20] <_joe_> that's actually better [11:09:34] <_joe_> so we don't mass-restart everything at the same time [11:09:50] RECOVERY - HHVM rendering on mw1075 is OK: HTTP OK: HTTP/1.1 200 OK - 70410 bytes in 0.681 second response time [11:10:00] (in reallity is should not affect, but I just prefer to not have to changes ongoing at the same time) [11:10:07] <_joe_> exactly [11:10:09] *two [11:10:40] so that we know who to blame :) [11:10:41] <_joe_> jynus: green light from me [11:10:51] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [11:10:53] <_joe_> I'm done for now, bbl [11:12:09] !log jynus Synchronized wmf-config/db-eqiad.php: Depool es1001 (duration: 00m 13s) [11:12:14] Logged the message, Master [11:12:41] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [11:12:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [11:13:13] do not like that [11:14:01] but it is coming from db1049, not the one I changed [11:16:12] and I think it is gone now [11:19:20] PROBLEM - puppet last run on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:31] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:40] 6operations, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1386974 (10Krenair) https://meta.wikimedia.org/w/index.php?title=Www.wikimedia.org_template/temp&diff=12487372&oldid=12374377 if copied to the real templa... [11:23:22] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:23:30] PROBLEM - SSH on mw1107 is CRITICAL - Socket timeout after 10 seconds [11:25:11] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:25:11] PROBLEM - DPKG on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:51] PROBLEM - Apache HTTP on mw1107 is CRITICAL - Socket timeout after 10 seconds [11:26:11] PROBLEM - HHVM rendering on mw1107 is CRITICAL - Socket timeout after 10 seconds [11:26:52] RECOVERY - SSH on mw1107 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:27:00] RECOVERY - DPKG on mw1107 is OK: All packages OK [11:30:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [11:32:11] PROBLEM - DPKG on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:34:01] PROBLEM - SSH on mw1107 is CRITICAL - Socket timeout after 10 seconds [11:34:11] PROBLEM - nutcracker port on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:34:11] PROBLEM - salt-minion processes on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:34:16] 1107 is quite dead (no salt answer) If someone can give it a look? [11:35:23] (but still responds to ping) [11:36:00] RECOVERY - salt-minion processes on mw1107 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:36:00] RECOVERY - nutcracker port on mw1107 is OK: TCP OK - 0.000 second response time on port 11212 [11:36:12] PROBLEM - configured eth on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:40:20] PROBLEM - dhclient process on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:40:20] PROBLEM - HHVM processes on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:40:41] PROBLEM - nutcracker process on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:30] PROBLEM - nutcracker port on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:30] PROBLEM - salt-minion processes on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:41] 6operations, 10Wikimedia-General-or-Unknown: [OPS] Puppet script for LaTeXML - https://phabricator.wikimedia.org/T56034#1387054 (10hashar) 5Open>3Resolved a:3hashar Assuming it has been fixed. [11:44:11] PROBLEM - salt-minion processes on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:44:20] PROBLEM - nutcracker port on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:44:20] PROBLEM - HHVM rendering on mw1101 is CRITICAL - Socket timeout after 10 seconds [11:44:31] PROBLEM - Disk space on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:44:59] akosiaris: can you merge, https://gerrit.wikimedia.org/r/#/c/219361/ ? [11:45:00] PROBLEM - RAID on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:01] PROBLEM - puppet last run on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:11] PROBLEM - dhclient process on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:12] PROBLEM - HHVM processes on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:12] PROBLEM - DPKG on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:12] PROBLEM - nutcracker process on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:12] akosiaris: will let me to test before it goes to Prod. [11:45:40] PROBLEM - configured eth on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:41] PROBLEM - Apache HTTP on mw1101 is CRITICAL - Socket timeout after 10 seconds [11:45:41] RECOVERY - salt-minion processes on mw1101 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:46:01] PROBLEM - Disk space on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:46:10] RECOVERY - nutcracker port on mw1101 is OK: TCP OK - 0.000 second response time on port 11212 [11:46:29] mark, how many 240GB SSDs can you put per system in the 4 we get? [11:46:49] 4 max ssds per system, so 2 added per system [11:47:50] RECOVERY - Disk space on mw1107 is OK: DISK OK [11:50:00] (03PS5) 10Hashar: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [11:50:28] (03CR) 10Hashar: "Removed reference to tracking bug T95002 since there is a more specific task T102623" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [11:50:45] 6operations, 10Continuous-Integration-Infrastructure: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1387086 (10hashar) [11:51:01] PROBLEM - SSH on mw1101 is CRITICAL - Socket timeout after 10 seconds [11:51:11] PROBLEM - salt-minion processes on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:51:21] PROBLEM - nutcracker port on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:01] PROBLEM - Disk space on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:54:21] mark, are those PCIx or SATA3? 1TB SATA3 drives start around $300. Which drives were you looking at? [11:55:02] RECOVERY - Disk space on mw1101 is OK: DISK OK [11:55:50] RECOVERY - HHVM processes on mw1101 is OK: PROCS OK: 6 processes with command name hhvm [11:55:51] RECOVERY - nutcracker process on mw1101 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:55:54] SATA [11:57:32] 6operations, 10Continuous-Integration-Infrastructure: Jessie does not have libmemcached10 - https://phabricator.wikimedia.org/T103315#1387102 (10hashar) 3NEW [11:57:53] 6operations, 10Graphoid, 6Services, 5Patch-For-Review: Confine Graphoid with firejail - https://phabricator.wikimedia.org/T103095#1387110 (10Yurik) You can use [[ http://en.wikipedia.beta.wmflabs.org/wiki/Special:PagesWithProp?propname=graph_specs&propname-other= | this link ]] to find graphs available on... [11:58:05] mobrovac, ^ [11:58:32] grazie yurik [11:58:58] 6operations, 10Continuous-Integration-Infrastructure: Jessie does not have libmemcached10 - https://phabricator.wikimedia.org/T103315#1387102 (10hashar) [11:59:10] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Fix no-mt YAML syntax [puppet] - 10https://gerrit.wikimedia.org/r/219361 (owner: 10KartikMistry) [12:00:31] PROBLEM - Disk space on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:00:50] what's with these alerts? [12:01:02] <_joe_> paravoid: OOMs [12:01:20] PROBLEM - HHVM processes on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:20] PROBLEM - nutcracker process on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:29] <_joe_> paravoid: I'm restarting HHVM on those servers [12:01:33] 6operations, 10Continuous-Integration-Infrastructure: Jessie does not have libvips15 - https://phabricator.wikimedia.org/T103322#1387163 (10hashar) 3NEW [12:01:54] OOMs why? [12:02:02] <_joe_> paravoid: after about 1 week of continuous operations, the smaller appservers will OOM. That is a week without restarting HHVM [12:02:30] <_joe_> or crashing, which happened much more pre-3.6 [12:03:08] 6operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-familly which is not available on Jessie - https://phabricator.wikimedia.org/T103325#1387187 (10hashar) 3NEW [12:04:12] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 9 below the confidence bounds [12:04:58] 6operations: Investigate Ubuntu fork of ttf-indic-fonts and bring it in Jessie - https://phabricator.wikimedia.org/T103328#1387212 (10hashar) 3NEW [12:05:41] RECOVERY - Host ms-be1002 is UPING OK - Packet loss = 0%, RTA = 0.51 ms [12:05:50] 6operations, 10Traffic: Package/backport openssl 1.0.2 + nginx 1.7.x or higher - https://phabricator.wikimedia.org/T96850#1387242 (10BBlack) [12:05:53] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1387241 (10BBlack) [12:06:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [12:06:23] 6operations, 5Patch-For-Review: Mediawiki font packages: switch to Jessie - https://phabricator.wikimedia.org/T102623#1387250 (10hashar) [12:06:25] 6operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-familly which is not available on Jessie - https://phabricator.wikimedia.org/T103325#1387251 (10hashar) [12:06:40] (03PS1) 10Faidon Liambotis: mediawiki: remove useless lib* dependencies [puppet] - 10https://gerrit.wikimedia.org/r/219810 [12:06:41] RECOVERY - High load average on ms-be1002 is OK - load average: 55.32, 19.29, 6.90 [12:06:42] 6operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-familly which is not available on Jessie - https://phabricator.wikimedia.org/T103325#1387187 (10hashar) [12:06:45] 6operations, 10Continuous-Integration-Infrastructure: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177707 (10hashar) [12:07:19] (03CR) 10Faidon Liambotis: [C: 032] mediawiki: remove useless lib* dependencies [puppet] - 10https://gerrit.wikimedia.org/r/219810 (owner: 10Faidon Liambotis) [12:07:31] RECOVERY - nutcracker port on mw1101 is OK: TCP OK - 0.000 second response time on port 11212 [12:07:31] RECOVERY - Disk space on mw1101 is OK: DISK OK [12:07:54] 6operations, 10Traffic: Support ALPN + HTTP/2 - https://phabricator.wikimedia.org/T96848#1387286 (10BBlack) [12:07:55] 6operations, 10Traffic: Evaluate limited caching inside nginx - https://phabricator.wikimedia.org/T96851#1387285 (10BBlack) [12:07:57] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1387287 (10BBlack) [12:07:59] 6operations, 10Traffic: Test then switch to openssl 1.0.2 + nginx 1.9.2 - https://phabricator.wikimedia.org/T96850#1387282 (10BBlack) 5stalled>3Open [12:07:59] !log powercycled ms-be1002, stuck at console [12:08:01] RECOVERY - RAID on mw1101 is OK no RAID installed [12:08:04] Logged the message, Master [12:08:10] RECOVERY - puppet last run on mw1101 is OK Puppet is currently enabled, last run 43 minutes ago with 0 failures [12:08:11] RECOVERY - dhclient process on mw1101 is OK: PROCS OK: 0 processes with command name dhclient [12:08:11] RECOVERY - nutcracker process on mw1101 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:08:11] RECOVERY - HHVM processes on mw1101 is OK: PROCS OK: 6 processes with command name hhvm [12:08:11] RECOVERY - DPKG on mw1101 is OK: All packages OK [12:08:19] _joe_: not sure I get it, why would unrelated nrpe checks hang? [12:08:23] _joe_: did the whole machine OOM? [12:08:30] <_joe_> yes [12:08:41] RECOVERY - configured eth on mw1101 is OK - interfaces up [12:08:49] why? [12:08:50] RECOVERY - SSH on mw1101 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [12:09:00] RECOVERY - salt-minion processes on mw1101 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:09:04] 6operations, 10Continuous-Integration-Infrastructure: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1387291 (10hashar) 3NEW [12:09:08] <_joe_> paravoid: or better, a shower of different services start oom'ing before hhvm dies [12:09:30] <_joe_> that's what I saw earlier on one appserver [12:09:32] 6operations, 10Continuous-Integration-Infrastructure: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1387311 (10faidon) [12:09:35] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jessie does not have libvips15 - https://phabricator.wikimedia.org/T103322#1387308 (10faidon) 5Open>3Resolved a:3faidon The manifests were wrong to hardcode specific package names for libraries. [12:09:41] 6operations, 10Continuous-Integration-Infrastructure: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177707 (10faidon) [12:09:44] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jessie does not have libmemcached10 - https://phabricator.wikimedia.org/T103315#1387312 (10faidon) 5Open>3Resolved a:3faidon The manifests were wrong to hardcode specific package names for libraries. [12:10:12] 6operations, 10Continuous-Integration-Infrastructure: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1387325 (10hashar) I have created sub tasks, the fonts related ones being under T102623. Result is: * {T103315} * {T103322} * {T102623} ** {T103325... [12:12:50] PROBLEM - puppet last run on mw1098 is CRITICAL puppet fail [12:15:11] PROBLEM - puppet last run on mw1101 is CRITICAL puppet fail [12:17:17] (03PS1) 10Muehlenhoff: Enable firejail for citoid [puppet] - 10https://gerrit.wikimedia.org/r/219811 (https://phabricator.wikimedia.org/T98851) [12:18:30] PROBLEM - NTP on mw1107 is CRITICAL: NTP CRITICAL: No response from NTP server [12:19:30] RECOVERY - puppet last run on ms-be1017 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:19:39] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1387380 (10faidon) Any news? [12:24:01] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [12:28:28] (03PS1) 10BBlack: enable ipsec for all codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/219813 (https://phabricator.wikimedia.org/T81543) [12:28:37] (03PS1) 10JanZerebecki: Wikidata build: use deep copy instead of git submodule [puppet] - 10https://gerrit.wikimedia.org/r/219814 [12:30:40] RECOVERY - Disk space on mw1107 is OK: DISK OK [12:30:41] RECOVERY - nutcracker process on mw1107 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:30:50] RECOVERY - puppet last run on mw1098 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:31:02] RECOVERY - NTP on mw1107 is OK: NTP OK: Offset -0.001434087753 secs [12:31:02] RECOVERY - RAID on mw1107 is OK no RAID installed [12:31:10] RECOVERY - SSH on mw1107 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [12:31:10] RECOVERY - DPKG on mw1107 is OK: All packages OK [12:31:20] RECOVERY - salt-minion processes on mw1107 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:31:20] RECOVERY - nutcracker port on mw1107 is OK: TCP OK - 0.000 second response time on port 11212 [12:31:40] RECOVERY - configured eth on mw1107 is OK - interfaces up [12:31:51] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.187 second response time [12:32:10] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 70391 bytes in 0.151 second response time [12:32:10] RECOVERY - HHVM processes on mw1107 is OK: PROCS OK: 6 processes with command name hhvm [12:32:11] RECOVERY - dhclient process on mw1107 is OK: PROCS OK: 0 processes with command name dhclient [12:32:35] (03PS1) 10BBlack: enable ipsec for half eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219816 (https://phabricator.wikimedia.org/T81543) [12:32:37] (03PS1) 10BBlack: enable ipsec for all eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219817 (https://phabricator.wikimedia.org/T81543) [12:33:02] (03CR) 10BBlack: [C: 04-1] "On hold pending planning..." [puppet] - 10https://gerrit.wikimedia.org/r/219816 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [12:33:19] (03CR) 10jenkins-bot: [V: 04-1] enable ipsec for half eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219816 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [12:33:34] (03CR) 10BBlack: [C: 04-1] "On hold pending planning..." [puppet] - 10https://gerrit.wikimedia.org/r/219817 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [12:34:11] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:39:40] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Backport libjsch-java to Precise - https://phabricator.wikimedia.org/T103342#1387394 (10hashar) 3NEW a:3hashar [12:40:34] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314411 (10hashar) [12:40:53] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Backport libjsch-java to Precise - https://phabricator.wikimedia.org/T103342#1387394 (10hashar) [12:42:34] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Backport libjsch-java to Precise - https://phabricator.wikimedia.org/T103342#1387418 (10hashar) a:5hashar>3None [12:42:39] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1387421 (10Manybubbles) None. It's still on the list but the team has been concentrating on other things that have yet to finish. If you want a quick fix I'll +1 disabling this check and leaving this ticket to discu... [12:47:20] (03PS2) 10BBlack: enable ipsec for all eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219817 (https://phabricator.wikimedia.org/T81543) [12:47:22] (03PS2) 10BBlack: enable ipsec for half eqiad text caches [puppet] - 10https://gerrit.wikimedia.org/r/219816 (https://phabricator.wikimedia.org/T81543) [12:47:46] (03PS1) 10Giuseppe Lavagetto: small fixes, version bump in setup.py as well [software/conftool] - 10https://gerrit.wikimedia.org/r/219820 [12:47:50] 6operations, 10Continuous-Integration-Infrastructure, 7Jenkins: Please refresh Jenkins package on apt.wikimedia.org to 1.609.1 - https://phabricator.wikimedia.org/T103343#1387431 (10hashar) 3NEW [12:47:52] (03CR) 10BBlack: [C: 04-1] "On hold pending planning..." [puppet] - 10https://gerrit.wikimedia.org/r/219816 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [12:48:01] (03CR) 10BBlack: [C: 031] "On hold pending planning..." [puppet] - 10https://gerrit.wikimedia.org/r/219817 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [12:48:03] heh [12:48:10] (03CR) 10BBlack: [C: 04-1] "On hold pending planning..." [puppet] - 10https://gerrit.wikimedia.org/r/219817 (https://phabricator.wikimedia.org/T81543) (owner: 10BBlack) [12:49:02] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [12:50:46] (03CR) 10JanZerebecki: Default wmgUseWikibaseQuality on beta to true. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219630 (https://phabricator.wikimedia.org/T99351) (owner: 10JanZerebecki) [13:06:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] confd: include the confd class in confd::file, start by default [puppet] - 10https://gerrit.wikimedia.org/r/219821 (owner: 10Giuseppe Lavagetto) [13:08:31] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Backport libjsch-java to Precise - https://phabricator.wikimedia.org/T103342#1387481 (10hashar) @MoritzMuehlenhoff Thanks! lanthanum is the other CI Precise slave so we can get the package upgraded there. I am happy to see the o... [13:09:19] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Backport libjsch-java to Precise - https://phabricator.wikimedia.org/T103342#1387482 (10hashar) [13:17:40] 6operations, 6Research-and-Data, 7Database: Test and fix db1047 BBU - https://phabricator.wikimedia.org/T103345#1387500 (10jcrespo) 3NEW a:3jcrespo [13:22:38] _joe_: mw1101 seems still in trouble [13:23:45] 6operations, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1387515 (10Krenair) Seems to have resolved itself..? [13:24:12] <_joe_> paravoid: yeah I'll take a look later, I'm pretty busy atm [13:25:07] or... perhaps it hasn't resolved itself [13:25:58] <_joe_> Krenair: pant, I promised I would take a look, will do [13:26:18] !log rebooting es1001 for regular maintenance [13:26:21] Logged the message, Master [13:26:25] is that related to mw1101? [13:26:48] <_joe_> nope [13:26:48] I only just logged in and have no idea what you guys are up to so don't let a few missing images distract you from important stuff [13:26:52] <_joe_> not at all [13:27:11] <_joe_> Krenair: but the reporter asked for ops assistance and I said I'll do that :) [13:27:17] ah [13:31:08] it seems to return either 404 or an actual image file depending on which machine I download it from, very strange [13:32:01] e.g. terbium gets an image and tin gets a 404? [13:32:47] locally I get an image, at college earlier I was getting 404, other people have reported 404s... [13:37:04] (03CR) 10Muehlenhoff: "citoid has been successfully tested on deployment-sca02, both using zotero:" [puppet] - 10https://gerrit.wikimedia.org/r/219811 (https://phabricator.wikimedia.org/T98851) (owner: 10Muehlenhoff) [13:43:08] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1387533 (10Ottomata) BTW, T102082 is mainly about analytics eventlogging, but the confluent stuff would be good for an event bus used for app... [13:43:45] (03PS2) 10Giuseppe Lavagetto: small fixes, version bump in setup.py as well [software/conftool] - 10https://gerrit.wikimedia.org/r/219820 [13:49:38] (03PS5) 10Giuseppe Lavagetto: lvs: Add definitions for conftool [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [13:56:23] (03CR) 10Alexandros Kosiaris: [C: 031] Enable firejail for citoid [puppet] - 10https://gerrit.wikimedia.org/r/219811 (https://phabricator.wikimedia.org/T98851) (owner: 10Muehlenhoff) [14:03:50] (03PS6) 10Giuseppe Lavagetto: lvs: Add definitions for conftool [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [14:08:51] PROBLEM - puppet last run on mw2163 is CRITICAL puppet fail [14:12:28] (03Abandoned) 10Hashar: Revert "puppet/self: use the appropriate override" [puppet] - 10https://gerrit.wikimedia.org/r/219155 (https://phabricator.wikimedia.org/T102947) (owner: 10Hashar) [14:14:02] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, 10wikitech.wikimedia.org: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1387575 (10Andrew) Actually, I think the issue is that we store an entire new copy of wikitech on each import. I propose we chang... [14:19:55] 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1387583 (10demon) >>! In T102566#1385611, @Nemo_bis wrote: >> Surely we have to draw the line of where the oldest base sys... [14:20:09] (03PS1) 10Andrew Bogott: Only back up the current state of wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/219827 (https://phabricator.wikimedia.org/T101803) [14:21:53] (03PS1) 10Hashar: Reenable sshd MAC/KEX hardening for Jenkins and Beta [puppet] - 10https://gerrit.wikimedia.org/r/219828 (https://phabricator.wikimedia.org/T100509) [14:22:01] (03PS2) 10Hashar: Reenable sshd MAC/KEX hardening for Jenkins and Beta [puppet] - 10https://gerrit.wikimedia.org/r/219828 (https://phabricator.wikimedia.org/T100509) [14:22:44] (03CR) 10Hashar: [C: 04-1 V: 04-1] "Updated libjsch-java is being backported to Precise T103342" [puppet] - 10https://gerrit.wikimedia.org/r/219828 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [14:25:14] (03PS2) 10KartikMistry: CX: Enable Apertium Machine Translation for Simple English [puppet] - 10https://gerrit.wikimedia.org/r/219779 (https://phabricator.wikimedia.org/T103067) [14:25:30] (03PS2) 10KartikMistry: CX: Set Hindi<->Urdu pairs to no-mt as defaults [puppet] - 10https://gerrit.wikimedia.org/r/219781 [14:26:22] (03PS1) 10Jcrespo: Repool es1001, depool es1002 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219830 [14:26:24] akosiaris: two more patches: https://gerrit.wikimedia.org/r/#/c/219779/ and https://gerrit.wikimedia.org/r/#/c/219781/ [14:26:27] :) [14:26:42] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:57] (I will wait for my maintenance) [14:28:25] (03PS2) 10Faidon Liambotis: puppet.conf: remove obsolete ca_md setting [puppet] - 10https://gerrit.wikimedia.org/r/219600 (owner: 10Gage) [14:28:31] (03CR) 10Faidon Liambotis: [C: 032 V: 032] puppet.conf: remove obsolete ca_md setting [puppet] - 10https://gerrit.wikimedia.org/r/219600 (owner: 10Gage) [14:29:00] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Set Hindi<->Urdu pairs to no-mt as defaults [puppet] - 10https://gerrit.wikimedia.org/r/219781 (owner: 10KartikMistry) [14:29:16] (03PS3) 10Alexandros Kosiaris: CX: Enable Apertium Machine Translation for Simple English [puppet] - 10https://gerrit.wikimedia.org/r/219779 (https://phabricator.wikimedia.org/T103067) (owner: 10KartikMistry) [14:29:33] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: Enable Apertium Machine Translation for Simple English [puppet] - 10https://gerrit.wikimedia.org/r/219779 (https://phabricator.wikimedia.org/T103067) (owner: 10KartikMistry) [14:30:09] (03PS3) 10Alexandros Kosiaris: CX: Set Hindi<->Urdu pairs to no-mt as defaults [puppet] - 10https://gerrit.wikimedia.org/r/219781 (owner: 10KartikMistry) [14:30:17] (03CR) 10Alexandros Kosiaris: [V: 032] CX: Set Hindi<->Urdu pairs to no-mt as defaults [puppet] - 10https://gerrit.wikimedia.org/r/219781 (owner: 10KartikMistry) [14:30:21] (03CR) 10Faidon Liambotis: [C: 04-1] "These packages do not exist in trusty (or precise) and our whole mw* fleet is running trusty. This would just break everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [14:32:22] kart_: Did you guys think about moving cxserver to service::node btw ? [14:32:27] !log restarting Jenkins [14:32:31] Logged the message, Master [14:34:14] (03CR) 10ArielGlenn: "Just a few comments, as I'm not the expert php dev." [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [14:36:12] jynus: have a moment? [14:36:40] andrewbogott, yes [14:36:43] (03CR) 10Alex Monk: [C: 031] Only back up the current state of wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/219827 (https://phabricator.wikimedia.org/T101803) (owner: 10Andrew Bogott) [14:37:46] jynus: I’m looking at https://phabricator.wikimedia.org/T101803. The issue is basically that wikitech-static’s drive is full because we’ve been syncing copy after copy after copy of wikitech up there. [14:38:09] I want to just wipe out all that history so we can begin again with a fresh, smaller sync. [14:38:31] (03PS2) 10Andrew Bogott: Only back up the current state of wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/219827 (https://phabricator.wikimedia.org/T101803) [14:38:42] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [14:38:44] what is the catch? [14:38:51] jynus: but I’m sure I know how to ‘wipe out all history’ without causing mysql to lose its mind. [14:39:01] ah [14:39:06] (03PS1) 10Faidon Liambotis: HTTPS: raise production's HSTS to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/219833 [14:39:12] history is ON mysql [14:39:17] Partly I’m thrown by ibdata1 being so gigantic and not db-specific. [14:39:31] they are not really static copyies [14:39:35] right? [14:40:17] Not really static, no — we just do a daily dump of wikitech and then import the entire dump to -static. [14:40:21] (03CR) 10BBlack: [C: 031] HTTPS: raise production's HSTS to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/219833 (owner: 10Faidon Liambotis) [14:40:25] yep, innodb_file_per_table=1 [14:40:26] (03CR) 10Hashar: "Applied on beta cluster puppetmaster. deployment-bastion sshd_config had:" [puppet] - 10https://gerrit.wikimedia.org/r/219828 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [14:40:34] Previously the dump included /all/ history. I’ve just changed that to not include history so future syncs will be smaller. [14:40:37] then we should reload the whole thing [14:40:51] this is as easy as it is to bring down the services [14:41:00] if we can, it is easy [14:41:15] Oh yeah, an outage on wikitech-static is fine [14:41:22] no one looks there unless wikitech dies :) [14:41:48] let me investigate a bit and I will update the ticket with proposed course of action [14:41:57] thank you! [14:42:29] jynus: you’re also welcome to put your plan in action, at least on -static. It’s not puppetized, pretty much a seat-of-pants operation. [14:42:42] (03CR) 10Andrew Bogott: [C: 032] Only back up the current state of wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/219827 (https://phabricator.wikimedia.org/T101803) (owner: 10Andrew Bogott) [14:43:00] yeah, an unmanaged db usually is very noticeable :) [14:43:33] I will try to fix that slowly with the trainings [14:43:59] jynus: -static is outside of our cluster — it’s docs of last resort in case of a meteor strike. [14:44:03] So having it unmanaged is a feature. [14:44:14] well, up to a point [14:44:29] yeah :) [14:44:32] not automatized != wrong config [14:45:01] agreed :) [14:45:38] I am not blaming you, I am blaming poor default config in older mysql [14:46:09] time for breakfast — back in a bit. [14:55:33] akosiaris: Thanks! [14:55:58] akosiaris: you mean service-runner? [14:56:17] jzerebecki: James_F ping for SWAT in ~4min [14:56:20] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1387695 (10hashar) [14:56:22] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Backport libjsch-java to Precise - https://phabricator.wikimedia.org/T103342#1387691 (10hashar) 5Open>3Resolved a:3hashar I have reenable the MAC/KEX on beta cluster but then: ``` fatal: no matching mac found: client: hmac-sh... [14:56:24] thcipriani: pong [14:56:25] Hrtr [14:56:30] Err. Here. [14:56:45] akosiaris: I'm working on updating apertium* packages for Debian. And will upload to Wheezy-backport once in testing. [14:56:57] That will remove big headache for me and you :) [14:57:13] Oh, wait. [14:57:17] I'm an idiot. [14:58:05] (it's not config :)) [14:58:28] Indeed. [14:58:36] Though arguably it should be. [14:58:40] Anyway. [14:58:44] 6operations, 10ops-codfw, 6Labs: rack and connect labstore-array4-codfw in codfw - https://phabricator.wikimedia.org/T93215#1387698 (10coren) This ticket has been open for some time; @papaul, can you confirm that this was done and that the current codfw setup mirrors the eqiad setup? Starting from a well-un... [14:58:56] James_F: could you make the backport into 1.26wmf10 and bump the submodule on core? [14:59:01] thcipriani: Doing it now. [14:59:05] James_F: Thanks! [14:59:07] Krenair: Want to +2 https://gerrit.wikimedia.org/r/#/c/219836/ too? [14:59:43] Thanks. [14:59:45] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1387704 (10coren) [14:59:57] * James_F blames himself for not coming into work yesterday to check this. [15:00:22] kart_: service-runner and service::node go hand in hand, so yes :-) [15:00:31] kart_: I suppose you mean jessie-backports btw ? [15:00:38] huh, no bot today [15:00:44] * thcipriani begins SWAT [15:00:49] (03CR) 10Mobrovac: [C: 031] Allow optional firejail containment for nodejs services. [puppet] - 10https://gerrit.wikimedia.org/r/219177 (https://phabricator.wikimedia.org/T101870) (owner: 10Muehlenhoff) [15:00:53] (03PS9) 10Ottomata: Modify eventlogging module to use new changes to eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/210765 (https://phabricator.wikimedia.org/T98779) (owner: 10Milimetric) [15:01:04] kart_: that would be cool. Removes a blocker for moving cxserver and apertium to jessie hosts [15:01:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219630 (https://phabricator.wikimedia.org/T99351) (owner: 10JanZerebecki) [15:01:14] (03Merged) 10jenkins-bot: Default wmgUseWikibaseQuality on beta to true. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219630 (https://phabricator.wikimedia.org/T99351) (owner: 10JanZerebecki) [15:01:16] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1387723 (10hashar) [15:02:01] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1387730 (10jcrespo) The main issue with MySQL was that due to the `innodb_file_per_table | OFF` configuration (default in 5.5), every time a... [15:02:28] (03CR) 10Faidon Liambotis: "To my knowledge there are no other wikis that we maintain outside the main fleet + wikitech + wikitech-static." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219265 (https://phabricator.wikimedia.org/T103021) (owner: 10BBlack) [15:02:41] thcipriani: https://gerrit.wikimedia.org/r/219837 [15:02:58] James_F: awesome, thanks. [15:03:33] (03CR) 10Faidon Liambotis: [C: 031] ""install-console" is better, as the new_install key is also used for logging in to the target system (if puppet hasn't run yet/has been br" [puppet] - 10https://gerrit.wikimedia.org/r/217016 (owner: 10Filippo Giunchedi) [15:04:24] !log thcipriani Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: Default wmgUseWikibaseQuality on beta to true. [[gerrit:219630]] (duration: 00m 14s) [15:04:26] jouncebot, next [15:04:27] In 4 hour(s) and 55 minute(s): Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150622T2000) [15:04:28] jzerebecki: ^ probably already on labs though [15:04:29] Logged the message, Master [15:04:35] thcipriani, ^ [15:05:18] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1387742 (10Eevans) >>! In T78514#1385425, @GWicke wrote: > Metrics have been less than reliable recently. We have observed se... [15:06:17] jzerebecki: everything look good with you labs patch? [15:06:54] thcipriani: didn't have any effect :( [15:06:56] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Backport libjsch-java to Precise - https://phabricator.wikimedia.org/T103342#1387748 (10hashar) The trielad-ssh2 version is not the Debian package: ``` $ apt-cache search trilead libjenkins-trilead-ssh2-java - Trilead SSH2 implemen... [15:07:27] http://wikidata.beta.wmflabs.org/wiki/Special:Version should not contain an extension named: Wikibase Quality [15:07:57] 6operations, 10Continuous-Integration-Infrastructure: Backport libjsch-java to Precise - https://phabricator.wikimedia.org/T103342#1387394 (10hashar) [15:09:05] jzerebecki: hmm, well beta-scap-eqiad is still going [15:09:34] 6operations, 10ops-eqiad, 6Labs: Labs: Disconnect labstore1001 from the shelves - https://phabricator.wikimedia.org/T103355#1387786 (10coren) 3NEW [15:10:45] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1387799 (10GWicke) We have seen reporting stop on 2.1.3, 2.1.6 and 2.1.7. It seems to happen more quickly on 2.1.7 (often aft... [15:12:10] jzerebecki: I'm going to go ahead and deploy James_F stuff, we'll come back to beta if it's still not updated. I see the change on deployment-bastion, FWIW. [15:12:18] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1387801 (10coren) If it is not possible to get the two shelves working today (Jun 22), then please disconnect them from labstore2001 entirely as we need a stable system as destination for a backup. [15:12:28] that's fine [15:15:02] 6operations, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1387814 (10Krenair) Or not. It seems to either be random or based on what machine I try to download the file from (e.g. I get an image on tin and terbium,... [15:15:11] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1387816 (10coren) [15:15:30] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1387817 (10coren) p:5Triage>3Unbreak! [15:16:00] akosiaris: err. yes :) [15:16:35] 6operations, 10ops-codfw, 6Labs: Labs: Install the new RAID controller in labstore2002 and test - https://phabricator.wikimedia.org/T103267#1387827 (10coren) [15:17:56] 6operations, 10ops-codfw, 6Labs: Labs: Install the new RAID controller in labstore2002 and test - https://phabricator.wikimedia.org/T103267#1385952 (10coren) This can be done safely with labstore2002 provided that it is first disconnected from the shelves. [15:18:25] 6operations, 10ops-eqiad, 6Labs: Labs: Disconnect labstore1001 from the shelves - https://phabricator.wikimedia.org/T103355#1387841 (10coren) p:5High>3Unbreak! [15:18:46] (03Abandoned) 10Hashar: Reenable sshd MAC/KEX hardening for Jenkins and Beta [puppet] - 10https://gerrit.wikimedia.org/r/219828 (https://phabricator.wikimedia.org/T100509) (owner: 10Hashar) [15:18:57] jzerebecki: looks like beta-scap-eqiad finally got your change out. [15:19:45] l10nupdate is not fast :) [15:19:51] PROBLEM - puppet last run on db2058 is CRITICAL puppet fail [15:19:52] thcipriani: yup works. thx. [15:19:59] (03PS1) 10ArielGlenn: make fetch/checkout report a little clearer [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219841 [15:20:15] bd808: sometimes it is much faster, from checking the build history [15:20:50] it all depends on if the scap run actually catches any updated i18n json files or not [15:21:15] most of the time it doesn't and the l10n step is a quick check [15:21:39] but when the cache needs to be rebuilt it takes 7-10 minutes on deployment-bastion [15:24:45] !log thcipriani Synchronized php-1.26wmf10/extensions/WikiEditor: SWAT: Reduce 'Edit' EventLogging schema sampling rate to 6.25% (1/16th) [[gerrit:219837]] (duration: 00m 13s) [15:24:49] Logged the message, Master [15:24:54] PROBLEM - YARN NodeManager Node-State on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:04] ^ James_F not sure if you can check something like that, but if you can, please check :) [15:26:31] RECOVERY - YARN NodeManager Node-State on analytics1016 is OK YARN NodeManager analytics1016.eqiad.wmnet:8041 Node-State: RUNNING [15:27:28] (03PS1) 10Ottomata: Set default replication factor for kafka topics to min(3, size($brokers_array) [puppet] - 10https://gerrit.wikimedia.org/r/219842 [15:30:01] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [15:32:49] (03CR) 10Ottomata: [C: 032] Set default replication factor for kafka topics to min(3, size($brokers_array) [puppet] - 10https://gerrit.wikimedia.org/r/219842 (owner: 10Ottomata) [15:32:54] (03CR) 10Ori.livneh: [C: 031] HTTPS: raise production's HSTS to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/219833 (owner: 10Faidon Liambotis) [15:36:09] (03PS1) 10ArielGlenn: report_minions: show minions known about in redis for repo [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219845 [15:36:52] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60570 bytes in 0.388 second response time [15:37:30] RECOVERY - puppet last run on db2058 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:39:20] mutante_, do you know if something on the mail services would cause Precedence: Bulk to be added in https://phabricator.wikimedia.org/T103359 ? [15:40:12] I'm not sure the sending logic around that was changed recently... [15:40:34] 6operations, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1387919 (10Joe) Ok, it seems related to the www.wikimedia.org apache config, in fact directly on an appserver I get: ``` $ curl -H 'Host: en.wikipedia.or... [15:42:07] (03CR) 10Eevans: [C: 031] Don't start cassandra on boot or via puppet [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/219503 (https://phabricator.wikimedia.org/T103134) (owner: 10GWicke) [15:42:19] 6operations, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1387924 (10Joe) we still see the other logos because they've been cached, apparently. I'm not sure what changed here, to be honest. [15:42:40] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [15:43:08] thcipriani: FWIW it seems fine but yeah, can't really tell. [15:43:25] James_F: ok, thanks [15:47:24] (03CR) 10Giuseppe Lavagetto: "Did anyone check what will enable => false do with systemd?" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/219503 (https://phabricator.wikimedia.org/T103134) (owner: 10GWicke) [15:50:45] 6operations, 10ops-codfw, 6Labs: rack and connect labstore-array4-codfw in codfw - https://phabricator.wikimedia.org/T93215#1387954 (10Papaul) Stay waiting on Chris to give me the layout of labstore2001-array4 [15:52:25] papaul: what do you mean the layout? [15:53:05] how is labstore array4--connected to labstore-2001? [15:59:35] 6operations, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1388009 (10Joe) Found it: ``` $ ls /srv/mediawiki/docroot/wwwportal w/ ``` there is no link to the static directory there, it must have been removed in... [15:59:51] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1388011 (10Joe) [16:00:14] <_joe_> thcipriani: see ^^ [16:00:28] <_joe_> this seems like something scapping or similar would solve [16:00:34] * thcipriani looks [16:01:03] (03PS1) 10Alexandros Kosiaris: Reorder bacula keypair key/certificate [puppet] - 10https://gerrit.wikimedia.org/r/219847 [16:04:27] 6operations, 6Labs, 3ToolLabs-Goals-Q4: Rename virt1000 to labcontrol1002, move to same subnet as labcontrol1001 - https://phabricator.wikimedia.org/T102646#1370417 (10Andrew) [16:04:46] _joe_: yeah, just needs a mediwiki-config patch then sync [16:05:35] (03PS10) 10Ottomata: Modify eventlogging module to use new changes to eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/210765 (https://phabricator.wikimedia.org/T98779) (owner: 10Milimetric) [16:05:53] (03CR) 10Ottomata: [C: 032 V: 032] Modify eventlogging module to use new changes to eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/210765 (https://phabricator.wikimedia.org/T98779) (owner: 10Milimetric) [16:07:21] !log deploying eventlogging 0.9. This includes changes for arbitrary eventlogging URIs in all eventlogging stages, as well as support for schema based kafka topic URIs. [16:07:26] Logged the message, Master [16:08:05] (03PS1) 10Andrew Bogott: Rename virt1000 to labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/219849 (https://phabricator.wikimedia.org/T102646) [16:08:16] !log disabling puppet on virt1000 [16:08:20] Logged the message, Master [16:08:48] _joe_: although looking at git it doesn't look like docroot/wwwportal/static was ever a directory :\ [16:09:46] <_joe_> thcipriani: uhm, then maybe chasemp has another possible smoking gun I guess [16:10:08] <_joe_> chasemp: which commit did you see? [16:10:09] I don't unfortunately [16:11:23] are we running hhvm on jessie anywhere yet? [16:12:01] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1388102 (10Papaul) @Coren I think I was waiting on you to swap first the controller from labstore2001 [16:12:58] !log shutting down virt1000 [16:13:03] Logged the message, Master [16:13:14] * andrewbogott crosses fingers [16:14:52] 6operations, 10ops-eqiad, 6Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Rename virt1000 to labcontrol1002, move to same subnet as labcontrol1001 - https://phabricator.wikimedia.org/T102646#1388119 (10Andrew) [16:15:00] bd808: no [16:15:20] ori: thx. that will make my scap patch a bit easier I think [16:15:29] bd808: nope [16:15:47] (03PS2) 10Jcrespo: Repool es1001, depool es1002 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219830 [16:15:53] bd808: we do not even have a package [16:16:12] (03CR) 10Jcrespo: [C: 032] Repool es1001, depool es1002 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219830 (owner: 10Jcrespo) [16:16:16] cmjohnson1: Does https://phabricator.wikimedia.org/T102646 require a physical move of the server? And, if so, is there space? [16:16:29] I wasn't looking forward to writing an upstart/systemd command compat layer into the app [16:17:53] andrewbogott: yes [16:17:59] to both? [16:18:41] cmjohnson1: sounds good, do you need me to do anything else then? Should I write a dns patch as well? [16:19:05] andrewbogott: I can make space in c5 for it [16:19:12] cmjohnson1: great. [16:19:22] yeah, it will need a dns change [16:19:30] cmjohnson1: by the way, those virt100x ciscos are on the way out. So as soon as they’re in your way let me know and I’ll decommission. [16:19:48] okay, that's good to know [16:20:09] I don’t know what we’ll do with them after that… gotta wrote to Cisco and ask if they want ‘em back :) [16:20:26] yeah, i will have to contact them to see what they want to do [16:20:44] that rack might be a good candidate to go to 10G [16:20:51] oh! If you have a contact over there, please do! Or forward contact info to me and I’ll write to them. [16:20:58] we don't have any 10G in row B [16:21:22] i have some paperwork from when we got them and robh may have a contact as well [16:21:55] andrewbogott: wil we need to schedule the move? [16:22:10] Are the HHVM .ini files in puppet? [16:22:15] i have no contact info, but keep me looped in on the results please =] [16:22:15] for virt1000? No, it’s powered down. [16:22:19] Lcawte: yeah [16:22:36] <_joe_> bd808: have you seen my comment? I think next week we'll have the possibility to depool servers easily from scap [16:22:37] okay..cool. You don't need to do anything with yet. [16:22:51] <_joe_> Lcawte: not properly, they are just data structures in puppet [16:23:15] _joe_: awesome. It should be easy to plug that into what I've been working on [16:23:25] robh: hey :) [16:23:31] andrewbogott, ready for T101803 when I can 1) stop the service 2) delete all the database contents [16:23:37] <_joe_> bd808: it should, 4 lines of python I guess [16:23:41] <_joe_> maybe 5 [16:24:09] <_joe_> bd808: but you'd run the depool directly from the deployment server [16:24:43] robh: still on the kill bugzilla patch merges today? [16:24:47] *for the [16:24:54] _joe_: As in references to where they should be and what permissions they have? (that's all I could find for them) [16:24:58] _joe_: hmm.... will it be possible to do it from the cluster hosts themselves too? [16:25:10] If not that Will take me a bit more work [16:26:05] 6operations, 10ops-eqiad: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#1388150 (10Andrew) 3NEW a:3mark [16:26:26] <_joe_> bd808: you /could/, but I'd prefer not to [16:26:39] !log jynus Synchronized wmf-config/db-eqiad.php: Repool es1001, depool es1002 (duration: 00m 14s) [16:26:43] cmjohnson1: actually, I can’t write the dns patch until i know the new mgmt ip right? [16:26:44] <_joe_> bd808: you need to coordinate restarts anyways, right? [16:26:44] Logged the message, Master [16:27:05] <_joe_> I mean you need to leave some time between subsequent restarts in the same DC [16:27:14] <_joe_> or we're going to serve errors to clients [16:27:19] jynus: let me get set up to re-sync and then you can go ahead. [16:27:21] andrewbogott: no, but it will be easier if I do it [16:27:28] cmjohnson1: great :) [16:27:39] _joe_: yes, but the method for scap-like things today is to control that with batch sizes on the job run queue [16:28:12] The change I'd need is a way to run a local to the deploy server step as around advice for the remote job [16:28:21] andrewbogott, just ping me when you want today or this week, it should take me just a sec [16:28:28] which isn't unpossible at all [16:29:47] (03PS1) 10Ottomata: Install eventlogging on analytics1010 and configure it to process events for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/219851 [16:29:52] jynus: go ahead. Let me know when we have an empty wiki and I’ll run the sync by hand. [16:30:43] (03CR) 10Ottomata: [C: 032] Install eventlogging on analytics1010 and configure it to process events for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/219851 (owner: 10Ottomata) [16:31:15] !log reseting wikitech-static mysql contents to improve fragmentation [16:31:19] Logged the message, Master [16:31:27] <_joe_> bd808: ok, we can talk about it later I guess :) [16:31:59] yup [16:32:26] (03PS2) 10BryanDavis: Add HHVM restart support [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) [16:32:43] (03CR) 10jenkins-bot: [V: 04-1] Add HHVM restart support [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) (owner: 10BryanDavis) [16:33:10] (03PS5) 10Florianschmidtwelzow: Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [16:33:37] (03PS1) 10ArielGlenn: git deploy cleanup to toss minion from redis [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219852 [16:34:07] (03CR) 10Florianschmidtwelzow: "PS5 is a rebase and applied the suggestion of Dereckson :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [16:35:42] jynus: ready? [16:36:57] andrewbogott, I am doing it already [16:37:07] not finished yet [16:37:11] jynoh, you’re importing too? great [16:37:14] buh [16:37:18] no no [16:37:24] ok :) [16:37:25] I am changing some extra things [16:37:27] great [16:37:37] that is why it is taking me some extra seconds [16:37:51] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1388237 (10Niedzielski) @Dzahn, thanks! You're correct. My stat1003 configuration was bogus. stat1003 is good to go. Other than fixing my configuration, what do I... [16:40:04] ok, done, andrewbogott [16:40:12] kept an old copy on /tmp [16:40:19] I can delete if the import is ok [16:40:24] ok. I wonder how to make mediawiki create new dbs... [16:40:50] do you want me to create a wikitech db? [16:41:03] "mysqladmin create wikitech" [16:41:07] that is all that it takes [16:41:48] yeah, there are probably initial settings which are lost though... [16:41:57] (03PS1) 10Ottomata: Server side eventlogging forwarder needs count => true to generate seqids [puppet] - 10https://gerrit.wikimedia.org/r/219853 [16:42:07] I have the copy- I can recover those [16:42:12] (03CR) 10Ottomata: [C: 032 V: 032] Server side eventlogging forwarder needs count => true to generate seqids [puppet] - 10https://gerrit.wikimedia.org/r/219853 (owner: 10Ottomata) [16:42:49] jynus: yeah, there are probably things like wikiname, admin name/password [16:43:03] I bet that’s in a table that’s separate from the actual content. Can you check and restore it if so? [16:43:13] tell me the table/password, and I will [16:43:22] table/db I mean [16:43:40] Hm, no idea :) Lemme look on silver [16:45:08] (03PS1) 10BryanDavis: scap: allow mwdeploy to control Apache processes via sudo [puppet] - 10https://gerrit.wikimedia.org/r/219854 [16:45:44] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1388287 (10Niedzielski) (Sorry, reopening.) @Dzahn, would you mind leaving my niedzielski membership? It allows me to +2 in Gerrit which is part of certain Android... [16:45:52] (03PS3) 10BryanDavis: Add HHVM restart support [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) [16:45:55] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1388288 (10Niedzielski) 5Resolved>3Open [16:47:08] (03CR) 10BryanDavis: Add HHVM restart support (034 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) (owner: 10BryanDavis) [16:48:38] (03PS7) 10Giuseppe Lavagetto: lvs: Add definitions for conftool [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [16:50:25] I didn't delete the mysql database, BTW, mysql permissions should be ok [16:52:47] (03PS1) 10Ottomata: eventlogging::processor's output_invalid parameter is now smarter. Make kafka processor use this [puppet] - 10https://gerrit.wikimedia.org/r/219857 [16:53:33] let me recover the whole thing on the new configuration, and you can take it from there [16:54:59] jynus: that might be best, thanks. [16:55:37] (03CR) 10Ottomata: [C: 032] eventlogging::processor's output_invalid parameter is now smarter. Make kafka processor use this [puppet] - 10https://gerrit.wikimedia.org/r/219857 (owner: 10Ottomata) [16:56:26] (03CR) 10Giuseppe Lavagetto: [C: 031] "I have tested this by compiling it on one of the load balancers, and I see no effective change taking place in for ex. the pybal configura" [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [16:56:32] jynus: right now mediawiki is telling me ‘DB connection error: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock'' [16:57:01] yep, I have shut it down [16:57:06] to dump the old [16:57:22] and then I will bing back the new to import that logicly [16:57:30] ok, seems happier now [16:57:34] it is a bit more steps than [16:57:38] well, sort of [16:57:40] I’ll stand back :) [16:57:49] if I could delete everithing [17:01:30] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1388338 (10thcipriani) It doesn't look like the a static link directory existed in mediawiki-config at `docroot/wwwportal` before,... [17:05:29] akosiaris: yt? [17:05:47] (03PS7) 10Dzahn: redirect old- to static-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/216734 (https://phabricator.wikimedia.org/T103190) (owner: 10John F. Lewis) [17:07:12] if the actual size of wikitech is more than 3 GB compressed, I will run out of space, andrewbogott [17:07:14] (03CR) 10Dzahn: [C: 032] redirect old- to static-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/216734 (https://phabricator.wikimedia.org/T103190) (owner: 10John F. Lewis) [17:07:47] ottomata: yup [17:08:10] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1388373 (10Ottomata) Do you need access to stat1002? If so, what do you need it for? See this doc for clarification: https://wikitech.wikimedia.org/wiki/Analyti... [17:08:32] akosiaris: can you check a analytics vlan ACL for me? [17:08:34] something is funky [17:08:45] i'm trying to connect to eventlog1001 on tcp ports 8421 and 8422 [17:08:51] jynus: worst case I can just start a fresh wiki. I’m sure that’s easy, I just haven’t done it before. [17:08:53] 8522 works, but 8422 does not [17:10:23] ottomata: from any analytics host I presume [17:10:29] yes, [17:10:47] specifically am trying analytics1010 at the moment, but stat1002 also doesn't work [17:11:51] ottomata: ok. I concur. lemme check the ACL [17:12:30] danke [17:13:24] ottomata: yup. those ports are not allowed indeed [17:13:38] yw [17:13:45] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1388392 (10BBlack) ok @faidon and I had a pages-long chat about this. I think we're going to step back and re-evaluate our options and best practices here instead of trying to... [17:14:22] (03PS2) 10Greg Grossmeier: make fetch/checkout report a little clearer [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219841 (https://phabricator.wikimedia.org/T103013) (owner: 10ArielGlenn) [17:14:29] akosiaris: make them allowed please :) [17:14:33] 6operations, 10ops-eqiad, 6Labs: Labs: Disconnect labstore1001 from the shelves - https://phabricator.wikimedia.org/T103355#1388399 (10coren) [17:14:37] 6operations, 10ops-codfw, 6Labs: Labs: Install the new RAID controller in labstore2002 and test - https://phabricator.wikimedia.org/T103267#1388400 (10coren) [17:20:09] (03CR) 10Dzahn: [C: 032] switch old-bugzilla to apache cluster [dns] - 10https://gerrit.wikimedia.org/r/216736 (https://phabricator.wikimedia.org/T103190) (owner: 10John F. Lewis) [17:20:12] ottomata: sure. wanna file the task just for documentation's sake ? [17:20:31] sure [17:21:09] 6operations, 10ops-codfw: cp2024 console + disk issues - https://phabricator.wikimedia.org/T103090#1388425 (10Papaul) @Bblack. checked in shipping, they haven't received the drive yet. giving it onto after lunch. [17:21:47] akosiaris: https://phabricator.wikimedia.org/T103381 [17:21:49] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Open tcp ports 8421 and 8422 to eventlog1001 to the Analytics VLAN - https://phabricator.wikimedia.org/T103381#1388430 (10Ottomata) a:5Ottomata>3akosiaris [17:21:57] JohnFLewis: uhh, possibly! I'm having trouble getting moving today so I didnt review those yet ;D [17:22:20] robh: mutante's taken it on so :) [17:22:41] (03CR) 10Ori.livneh: "Since cronolog will continuously write to the same file with the -o option, the thing to verify is that cronolog checks that its file hand" [puppet] - 10https://gerrit.wikimedia.org/r/219788 (owner: 10Ori.livneh) [17:22:46] if you want work - I think there is a mailman task or two ;) [17:23:37] PROBLEM - YARN NodeManager Node-State on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:37] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Open tcp ports 8421 and 8422 to eventlog1001 to the Analytics VLAN - https://phabricator.wikimedia.org/T103381#1388451 (10akosiaris) 5Open>3Resolved Done and tested. Resolving [17:25:18] YESS thank you. [17:26:21] jynus: I’m lost — what’s happening now? Are you still restoring the old db? [17:26:21] no, I am dumping the old one [17:26:22] ‘dumping’ as in, deleting? Or as in… something else? [17:26:23] backuping [17:26:23] but I run out of space [17:26:50] ok. I think if you can give me a working mysql but with no dbs at all, I can create a fresh wiki and go from there. [17:27:06] well, that is what I had before [17:27:07] RECOVERY - YARN NodeManager Node-State on analytics1016 is OK YARN NodeManager analytics1016.eqiad.wmnet:8041 Node-State: RUNNING [17:27:44] jynus: yes, but I still think it’s probably the right approach. [17:27:57] you stopped mysql while I was still trying to set it up :) [17:28:07] ok, ok [17:28:14] I will put it back there [17:29:16] thanks [17:29:23] there you have, an empty mysql instance [17:29:37] old one is on /var/lib/mysql.old [17:29:48] delete it when you have something working [17:30:34] thanks [17:31:14] (03CR) 10GWicke: "@Giuseppe, puppet output of applying this in labs is:" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/219503 (https://phabricator.wikimedia.org/T103134) (owner: 10GWicke) [17:33:03] (03PS2) 10BBlack: HTTPS: raise production's HSTS to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/219833 (owner: 10Faidon Liambotis) [17:33:20] jynus: ok, now it says ‘DB connection error: Access denied for user 'root'@'localhost' (using password: NO) ()’ when I try to create the new db tables. [17:33:23] (And I’m in a meeting now, sorry) [17:33:27] (03CR) 10BBlack: [C: 032] HTTPS: raise production's HSTS to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/219833 (owner: 10Faidon Liambotis) [17:34:01] (03CR) 10BBlack: [V: 032] HTTPS: raise production's HSTS to 14 days [puppet] - 10https://gerrit.wikimedia.org/r/219833 (owner: 10Faidon Liambotis) [17:34:02] andrewbogott, you are not using a password, and I left root as it was before I used it, with a password [17:34:19] ah, I see. I’ve no idea what the password was, but I will dig when I have a chance. [17:34:25] (the one on .my.cnf) [17:34:51] I can delete if if you want, but I have not added anything it wasn't there before [17:35:41] you should not create a new install with the root user, though [17:37:13] ok [17:37:33] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-102: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1388543 (10faidon) [17:37:36] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100, and 2 others: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1388542 (10faidon) 5stalled>3Resolved [17:39:13] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, 5Patch-For-Review: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1388553 (10faidon) [17:41:17] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, 5Patch-For-Review: Backport sshd with AuthorizedKeysCommand support to Ubuntu precise - https://phabricator.wikimedia.org/T102401#1388562 (10faidon) [17:41:46] 6operations, 10ops-eqiad, 6Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Rename virt1000 to labcontrol1002, move to same subnet as labcontrol1001 - https://phabricator.wikimedia.org/T102646#1388565 (10faidon) [17:46:26] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: redirect old-bugzilla to static-bugzilla - https://phabricator.wikimedia.org/T103190#1388591 (10Dzahn) a:3Dzahn [17:47:20] 6operations, 5Patch-For-Review: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1388593 (10Andrew) [17:48:16] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [17:49:17] !log Bugzilla has left the building [17:49:21] Logged the message, Master [17:49:58] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.455 second response time [17:51:14] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: redirect old-bugzilla to static-bugzilla - https://phabricator.wikimedia.org/T103190#1388614 (10Dzahn) https://old-bugzilla.wikimedia.org/show_bug.cgi?id=1 [tin:~] $ apache-fast-test oldbz.urls mw1033 testing 2 urls on 1 servers, totalling 2 requests spa... [17:51:31] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: redirect old-bugzilla to static-bugzilla - https://phabricator.wikimedia.org/T103190#1388617 (10Dzahn) 5Open>3Resolved [17:51:32] 6operations, 10Wikimedia-Bugzilla: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1388618 (10Dzahn) [17:51:35] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1388619 (10Dzahn) [17:51:38] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1387901 (10chasemp) [17:52:21] (03PS1) 10Gilles: Enable TinyRGB ICC profile swapping on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219867 (https://phabricator.wikimedia.org/T100976) [17:52:33] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1387901 (10chasemp) [17:55:31] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1388642 (10chasemp) [17:58:06] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1388653 (10chasemp) [] **Confd service is not starting / running** Proposal: fix upstart configuration and check systems [] **Confd has invalid template or toml configuration**... [18:01:02] !log live-hacking mw1017 to debug T103053 [18:01:06] Logged the message, Master [18:01:28] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (-56 100000s) [18:13:37] 6operations, 5Continuous-Integration-Isolation: Figure out fine sudo rules for the nodepool service - https://phabricator.wikimedia.org/T102281#1388700 (10chasemp) [18:15:42] 6operations, 5Continuous-Integration-Isolation: Figure out fine sudo rules for the nodepool service - https://phabricator.wikimedia.org/T102281#1388702 (10chasemp) Let's not add #Ops-Access-Requests here as it flags this as a real needs review access request. Post this ticket can make some with whatever the o... [18:17:05] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1388705 (10Niedzielski) @Ottomata, @Dzahn, I just checked with my team and it sounds like stat1003 access is sufficient. Please consider this request resolved. Th... [18:22:38] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1388725 (10Dzahn) 5Open>3Resolved @Niedzielski alright, cool, thanks [18:30:14] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1388746 (10Dzahn) @Niedzielski I would prefer if we can just add the @wikimedia.org user into the WMF group, could you use that one on Gerrit or would that be a b... [18:30:40] Oh dear, getting a 503 on etherpad.wm.org [18:31:57] odder: yeah, it seems it's having issues [18:32:14] (03CR) 10Mobrovac: "Moronic question, but: @GWicke Puppet did not stop Cassandra because of enable => false, right?" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/219503 (https://phabricator.wikimedia.org/T103134) (owner: 10GWicke) [18:33:08] YuviPanda: Hola, can you think of any reason why tin would deny my new pubkey? changed in https://gerrit.wikimedia.org/r/#/c/212026/ [18:33:37] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1388767 (10Niedzielski) @Dzahn, it's not a problem but a strong preference. I have ~50 patches in Gerrit already under my niedzielski account and it's the same han... [18:33:56] (03CR) 10GWicke: "@mobrovac, it did not stop cassandra." [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/219503 (https://phabricator.wikimedia.org/T103134) (owner: 10GWicke) [18:39:49] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1388820 (10coren) Any testing we do should be to labstore2002 - we need 2001 up now for backup. Please try to fix the broken shelves and - if you can't find the issue for sure quickly enough - simply... [18:42:55] 6operations, 10ops-codfw, 6Labs: Labs: Install the new RAID controller in labstore2002 and test - https://phabricator.wikimedia.org/T103267#1388829 (10Papaul) All shelves are disconnect form labstore2002. New controller card in place. [18:45:01] (03PS1) 10Ori.livneh: Update my (=ori's) git / bash aliases [puppet] - 10https://gerrit.wikimedia.org/r/219894 [18:45:45] (03PS2) 10Ori.livneh: Update my (=ori's) git / bash aliases [puppet] - 10https://gerrit.wikimedia.org/r/219894 [18:46:03] (03CR) 10Ori.livneh: [C: 032 V: 032] Update my (=ori's) git / bash aliases [puppet] - 10https://gerrit.wikimedia.org/r/219894 (owner: 10Ori.livneh) [18:47:48] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1388871 (10Papaul) @Coren do you want me remove the old controller card or just leave it in there? [18:52:05] !log ori Synchronized php-1.26wmf10/includes/OutputPage.php: I0e5f2d3b2: Construct clean canonical URLs for wiki pages, ignoring request URL (T67402) (duration: 00m 14s) [18:52:10] Logged the message, Master [18:52:11] ^ bblack [18:53:43] https://en.wikipedia.org/?title=Modula-2 has the right canonical link now [18:59:00] thanks ori [18:59:13] that's great [18:59:53] greg-g: I haven't used Etherpad in months now, but is it normal to lose connection to it every couple of minutes or so? [19:00:04] odder: it was crashing, akosiaris was investigating [19:00:06] no, there's something going on [19:01:33] Well, it's quite unusable right now; I guess I'll come back later. [19:01:43] !log rebooting es1002 [19:01:47] Logged the message, Master [19:02:22] odder: greg-g https://github.com/ether/etherpad-lite/issues/2522 [19:02:31] chasing down the one pad that crashes etherpad [19:02:32] (03CR) 10Ori.livneh: [C: 04-1] "Thanks a ton for this. Looks good, couple of minor points." (035 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) (owner: 10BryanDavis) [19:03:04] the fact that a serious bug is open for a long time now is not making me very optimistic [19:04:42] black hole bug ? [19:04:45] lol [19:06:30] damn, everyone is using our etherpad installation. and I mean everyone [19:06:51] from ppl exchanging love notes, to RPG players, to ppl keeping their diaries to pretty much anyone [19:07:03] LOL [19:07:09] lol [19:07:56] Love notes, srsly? :D [19:08:19] akosiaris, job well done! [19:09:02] MaxSem: ? [19:09:32] akosiaris, means we have the most stable EP installationon the webz [19:09:52] for unearthing a 3 month old "Serious Bug" ? Sigh... I hate etherpad [19:10:03] diaries? [19:10:11] love notes? [19:10:12] :) [19:10:16] they know that crap's public right? [19:10:19] and the one that doesn't crumble under 3 users' load [19:10:39] I think PiratePad is quite okay, too. [19:11:04] apergos, add a disclaimer that all crap unrelated to WMF will be twitted publicly [19:11:27] then we'd get all the exhibitionists [19:12:03] MaxSem: hmm, never saw it from that perspective [19:12:22] so, WIN! :P [19:13:18] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1388990 (10coren) On 2001, leave the controller card as-is. We need the system in a predictable state. [19:16:32] hahaha [19:19:07] (03PS1) 10Ori.livneh: scap: require python-psutil and python-netifaces [puppet] - 10https://gerrit.wikimedia.org/r/219938 [19:19:25] (03PS2) 10Ori.livneh: scap: require python-psutil and python-netifaces [puppet] - 10https://gerrit.wikimedia.org/r/219938 [19:19:36] (03CR) 10Ori.livneh: [C: 032 V: 032] scap: require python-psutil and python-netifaces [puppet] - 10https://gerrit.wikimedia.org/r/219938 (owner: 10Ori.livneh) [19:19:49] ori: sweet that will kill a bunch of copied code [19:19:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [19:20:08] 6operations, 10ops-eqiad, 6Labs: Labs: Disconnect labstore1001 from the shelves - https://phabricator.wikimedia.org/T103355#1388999 (10Cmjohnson) disconnected everything from lasbstore1001. [19:20:19] coren: do you want me to swap controller now? [19:20:26] wow, huge 5xx spike [19:20:41] cmjohnson1: Wait wait. Context. Ticket? [19:20:58] https://phabricator.wikimedia.org/T103355 [19:21:34] ori, db errors [19:22:28] cmjohnson1: Hm. That's disconnecting the shelves for a safe rebuild, but I'd rather we test the new controller in codfw before we mess with it in eqiad given we've only one egg left in our basket right now. We'll wait until after Wikimania. [19:22:40] okay [19:23:10] 6operations, 10ops-eqiad, 6Labs: Labs: Disconnect labstore1001 from the shelves - https://phabricator.wikimedia.org/T103355#1389014 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson done [19:23:26] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1389023 (10Cmjohnson) [19:23:28] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: Locate spare H800 PERC in case it is necessary to switch labstore1002's - https://phabricator.wikimedia.org/T101743#1389021 (10Cmjohnson) 5Open>3Resolved We have a spare card on-site but we ordered different cards. [19:23:33] cmjohnson1: ty [19:23:41] yw [19:24:06] PROBLEM - YARN NodeManager Node-State on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:25:08] 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1389030 (10Cmjohnson) These disks were purchased separate from the servers. I will need to talk with @robh and see if we have any warranties. [19:26:00] 6operations, 10ops-codfw, 6Labs: Labs: Install the new RAID controller in labstore2002 and test - https://phabricator.wikimedia.org/T103267#1389035 (10coren) [19:26:04] jynus: as MaxSem noted, there was a huge spike of database errors a few minutes ago: https://gdash.wikimedia.org/dashboards/reqerror/ [19:26:19] https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [19:26:21] 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1389037 (10GWicke) The Samsung SSDs normally come with a 5-year warranty from Samsung itself. [19:26:28] Error connecting to 10.64.16.158: Too many connections [19:26:34] ^ jynus [19:26:36] ori, checking [19:26:37] RECOVERY - Host labstore2001 is UPING OK - Packet loss = 0%, RTA = 43.01 ms [19:26:43] thanks [19:26:52] that's pc1003 [19:27:01] think I got the offending PAD [19:27:01] yup [19:28:54] it's getting saturated [19:29:26] RECOVERY - YARN NodeManager Node-State on analytics1016 is OK YARN NodeManager analytics1016.eqiad.wmnet:8041 Node-State: RUNNING [19:29:48] (03PS1) 10coren: Switch labstore1001 to Jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/219956 [19:30:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:31:07] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:31:11] is memcache failing? [19:31:19] or did [19:34:46] !log delete pad:ips from etherpad [19:34:51] Logged the message, Master [19:39:43] halfak (cc: bblack): we noticed this with ori, http://grafana.wikimedia.org/#/dashboard/db/activity -- switch to last 30 days [19:40:09] also bblack: https://graphite.wikimedia.org/render/?title=navigationStart%20to%20loadEventEnd%20on%20desktop%20sites,%20last%20month&vtitle=milliseconds&from=-1month&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=alias%28color%28frontend.navtiming.totalPageLoadTime.desktop.overall.median,%22blue%22%29,%22median%22%29 [19:40:20] (but the mobile view is worse) [19:40:47] PROBLEM - Certificate expiration on nembus is CRITICAL: SSL CRITICAL - Certificate ldap-codfw.wikimedia.org valid until 2015-09-20 19:36:03 +0000 (expires in 89 days) [19:42:14] Quick review for https://gerrit.wikimedia.org/r/#/c/219956/ <-- simple pxe tweak [19:43:33] (03CR) 10Andrew Bogott: [C: 032] Switch labstore1001 to Jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/219956 (owner: 10coren) [19:43:47] 6operations, 10Beta-Cluster, 6Labs, 7Monitoring: Setup (simple) catchpoint monitoring for betacluster - https://phabricator.wikimedia.org/T97865#1389081 (10hashar) @yuvipanda can you handle replicating one of the catchpoint probe to hit en.wikipedia.beta.wmflabs.org ? Whatever is done for the production e... [19:44:17] 6operations, 10Beta-Cluster, 6Labs, 7Monitoring: Setup (simple) catchpoint monitoring for enwiki betacluster just like production - https://phabricator.wikimedia.org/T97865#1389084 (10hashar) [19:44:34] andrewbogott: ty. [19:44:51] Coren: I’m merging on carbon too, should be done in a few. [19:45:04] kk. [19:45:07] PROBLEM - Certificate expiration on neptunium is CRITICAL: SSL CRITICAL - Certificate ldap-eqiad.wikimedia.org valid until 2015-09-20 19:41:02 +0000 (expires in 89 days) [19:48:30] 6operations, 10ops-eqiad, 10Traffic: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1389094 (10Cmjohnson) All the cp10xx had front bezels. I removed them to allow more airflow. [19:49:27] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Isolation: Get Dan Duvall TEMP root to labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T102133#1389095 (10chasemp) We talked about this in last weeks ops meeting. We are fine with Mr. Duvall in this context. [19:53:36] (03PS4) 10BryanDavis: Add HHVM restart support [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) [19:54:19] paravoid: Re those graphs: whoa, what happened on 6/18? [19:54:51] probably https://gerrit.wikimedia.org/r/#/c/219101/ [19:54:56] (03PS1) 10Rush: Nodepool temporary roots [puppet] - 10https://gerrit.wikimedia.org/r/219959 [19:55:03] Maybe [19:55:05] but also possibly related to HTTPS/SPDY? unclear yet [19:55:08] When did the de-bits-ification thing happen? [19:55:09] (03CR) 10BryanDavis: Add HHVM restart support (033 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) (owner: 10BryanDavis) [19:55:18] > 1 month I think [19:56:07] (03PS2) 10Rush: Nodepool temporary roots [puppet] - 10https://gerrit.wikimedia.org/r/219959 [19:57:24] Hmm, https://graphite.wikimedia.org/render/?title=navigationStart%20to%20loadEventEnd%20on%20desktop%20sites,%20last%20month&vtitle=milliseconds&from=-3month&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=alias%28color%28frontend.navtiming.totalPageLoadTime.desktop.overall.median,%22blue%22%29,%22median%22%29 [19:57:29] That does go up on 5/5 [19:57:40] But now we're way below old levels [19:57:47] yes [19:58:02] go back to 12-month [19:58:10] we are now lower than we have ever been [19:58:30] There's some data in January that I assume is bad? [19:58:38] probably [19:58:53] Also what's TTFE? Time To First .... Event? [19:59:05] where do you see that? [19:59:23] (03CR) 10Rush: [C: 032] Nodepool temporary roots [puppet] - 10https://gerrit.wikimedia.org/r/219959 (owner: 10Rush) [19:59:27] https://grafana.wikimedia.org/#/dashboard/db/activity at the bottom [19:59:37] PROBLEM - puppet last run on mc2005 is CRITICAL puppet fail [19:59:37] 6operations, 5Continuous-Integration-Isolation: Remove hashar and dduvall root access on to be installed labnodepool1001 - https://phabricator.wikimedia.org/T95303#1389157 (10chasemp) [19:59:50] time to first edit [20:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150622T2000). Please do the needful. [20:00:26] yeah, time from registration to first edit [20:00:41] i'm still unsure about its value as a metric [20:01:30] Do you remove blocked users from that? If not, it's probably heavily flawed by spam bots and sock puppets, ... [20:02:29] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1389169 (10faidon) The more I think about it, the more a solution where we pin 3-5 keys (one primary RSA, one primary ECDSA, one backup but online where the primary lies, 1-2 o... [20:03:41] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1389172 (10Papaul) visual observation : all disks have green light on all the shelves,and all cable are attached to the shelves are in place. on H700 controller, it is showing 12 disks and the status... [20:04:41] Aah OK [20:05:50] 7Puppet, 3Mobile-Web, 5Patch-For-Review, 3Readership-Web-Next-Sprint-50-X______________: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1389180 (10Jdlrobson) [20:06:17] 7Puppet, 3Mobile-Web, 5Patch-For-Review, 3Readership-Web-Next-Sprint-50-X______________: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1383579 (10Jdlrobson) It looks like this is working on English Wikipedia. Would someone double check this and check redirect is working... [20:06:55] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Isolation: Get Dan Duvall TEMP root to labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T102133#1389184 (10chasemp) 5Open>3Resolved a:3chasemp {F687} [20:08:11] 7Puppet, 3Mobile-Web, 5Patch-For-Review, 3Readership-Web-Next-Sprint-50-X______________: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1389194 (10faidon) 5Open>3Resolved a:3faidon Confirmed. [20:08:15] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Isolation: Get Dan Duvall TEMP root to labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T102133#1389198 (10chasemp) https://gerrit.wikimedia.org/r/#/c/219959/ [20:08:22] starting parsoid deploy [20:13:35] (03PS8) 10Rush: lvs: Add definitions for conftool [puppet] - 10https://gerrit.wikimedia.org/r/219482 [20:16:07] RECOVERY - puppet last run on mc2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:21] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1389234 (10thcipriani) so it looks like if you revert to this revision: https://meta.wikimedia.org/w/index.php?title=Www.wikimedia... [20:21:59] 7Blocked-on-Operations, 6operations, 10Parsoid: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1389253 (10cscott) This is still broken: ``` $ git deploy service restart Error received from salt; raw output: Failed to au... [20:22:35] akosiaris, apergos: `git deploy service restart` is still broken. :( ^^ [20:22:55] i did a manual `$ for wtp in `ssh bast1001.wikimedia.org cat /etc/dsh/group/parsoid` ; do echo $wtp ; ssh $wtp sudo service parsoid restart ; done` workaround for now. [20:23:27] PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:25:14] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1389268 (10Krenair) >>! In T103296#1389234, @thcipriani wrote: > so it looks like if you revert to this revision: > https://meta.w... [20:25:41] cscott: thanks, still tracking it down from the previous report [20:26:01] I guess that git deploy fetch and checkout were ok? [20:27:54] apergos: yes. [20:28:06] all right, just making sure. [20:28:34] !log updated Parsoid to version d488783e [20:28:39] Logged the message, Master [20:29:45] apergos: i'm about to deploy ocg, which also has a `git deploy service restart` step. that's always worked for me in the past, we'll see if it still does so (if not, there's some sort of salt regression) [20:30:40] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1389300 (10Krenair) [20:31:42] well if it doesn't that might be me having screwed up the restart function [20:31:43] we'll see [20:32:21] cscott: I will likely not be around by then but could you let me know the results (my shadow will be here in irc)? [20:34:27] (03CR) 10Rush: [C: 032] lvs: Add definitions for conftool [puppet] - 10https://gerrit.wikimedia.org/r/219482 (owner: 10Rush) [20:34:29] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1389319 (10jcrespo) a:3jcrespo I checked and the request is legit. However, it is almost 23h here and the last thing I want to do now is run a pt-osc with... [20:36:23] (03CR) 10Ori.livneh: "Tiny tiny point, +2 otherwise" (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) (owner: 10BryanDavis) [20:42:11] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Isolation: Get Dan Duvall TEMP root to labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T102133#1389388 (10dduvall) Thanks! [20:42:48] (03PS5) 10BryanDavis: Add HHVM restart support [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) [20:43:35] (03CR) 10BryanDavis: Add HHVM restart support (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) (owner: 10BryanDavis) [20:45:56] (03CR) 10Ori.livneh: [C: 032] Add HHVM restart support [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) (owner: 10BryanDavis) [20:46:16] (03Merged) 10jenkins-bot: Add HHVM restart support [tools/scap] - 10https://gerrit.wikimedia.org/r/219751 (https://phabricator.wikimedia.org/T103008) (owner: 10BryanDavis) [20:47:27] 6operations, 10Deployment-Systems, 7HHVM, 5Patch-For-Review, 15User-Bd808-Test: Scap should restart HHVM - https://phabricator.wikimedia.org/T103008#1389415 (10chasemp) thanks @bd808 [20:48:41] (03PS3) 10BryanDavis: Move dsh group file names to config [tools/scap] - 10https://gerrit.wikimedia.org/r/219752 [20:49:03] (03CR) 10BryanDavis: "PS3 was a manual rebase" [tools/scap] - 10https://gerrit.wikimedia.org/r/219752 (owner: 10BryanDavis) [20:50:57] mutante: on the icinga warning for puppet on silver you wrote ‘270113' [20:51:00] What is 270113? [20:51:07] RECOVERY - DPKG on labstore1001 is OK: All packages OK [20:51:10] oh wait, that’s a comment id [20:51:14] hm, how do I find the actual comment? [20:52:05] I wonder if the icinga people have heard of this thing called a ‘link’? [20:52:38] andrewbogott: what is this link you speak of? [20:53:03] nope all webservices should be written as cgi in c andrewbogott [20:53:18] helpfully icinga shows a chat bubble when I hover over the ‘acknowledged’ graphic. The chat bubble is truncated by my browser window so I cannot read the text [20:54:41] (03CR) 10dschwen: [C: 031] Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [20:56:03] (03CR) 10Ori.livneh: [C: 032] Move dsh group file names to config [tools/scap] - 10https://gerrit.wikimedia.org/r/219752 (owner: 10BryanDavis) [20:56:23] (03Merged) 10jenkins-bot: Move dsh group file names to config [tools/scap] - 10https://gerrit.wikimedia.org/r/219752 (owner: 10BryanDavis) [20:57:27] (03PS2) 10Ori.livneh: scap: allow mwdeploy to control Apache processes via sudo [puppet] - 10https://gerrit.wikimedia.org/r/219854 (owner: 10BryanDavis) [20:57:32] (03CR) 10Ori.livneh: [C: 032] scap: allow mwdeploy to control Apache processes via sudo [puppet] - 10https://gerrit.wikimedia.org/r/219854 (owner: 10BryanDavis) [20:57:46] (03CR) 10Ori.livneh: [V: 032] scap: allow mwdeploy to control Apache processes via sudo [puppet] - 10https://gerrit.wikimedia.org/r/219854 (owner: 10BryanDavis) [20:58:55] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1389456 (10chasemp) [20:59:06] was redis updated on beta? [20:59:12] (03PS1) 10Ori.livneh: add 'scap-test' dsh group, containing 5 hosts from eqiad and 5 from codfw [puppet] - 10https://gerrit.wikimedia.org/r/219980 [21:01:23] looks like redis got updated to 2.8.x in beta? but not in production? [21:02:57] PROBLEM - Disk space on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:29] PROBLEM - SSH on mw1101 is CRITICAL - Socket timeout after 10 seconds [21:03:29] PROBLEM - salt-minion processes on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:47] PROBLEM - DPKG on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:58] PROBLEM - HHVM processes on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:18] PROBLEM - RAID on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:23] 7Puppet, 10MediaWiki-extensions-NavigationTiming, 6Performance-Team, 5Patch-For-Review, and 2 others: Track state (region) - https://phabricator.wikimedia.org/T101819#1389508 (10Jdforrester-WMF) [21:04:28] PROBLEM - configured eth on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:44] anyone on mw1101 ? [21:06:18] heheh, bblack [21:06:28] how do I invalidate a cached page in varinsh? [21:06:30] can't reach it via ssh or anything [21:06:34] curl -H 'Host: datasets.wikimedia.org' http://cp1043.eqiad.wmnet/aggregate-datasets/search/app_event_counts.tsv | wc -l [21:06:34] 721 [21:06:45] curl -H 'Host: datasets.wikimedia.org' http://cp1044.eqiad.wmnet/aggregate-datasets/search/app_event_counts.tsv | wc -l 598 [21:06:57] the two misc varnish hosts have different versions of this page [21:06:57] PROBLEM - dhclient process on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:02] and that is breaking some search dashboards [21:07:08] PROBLEM - nutcracker port on mw1101 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:14] console shows nothing meaningful [21:07:18] PROBLEM - nutcracker process on mw1101 is CRITICAL: Timeout while attempting connection [21:07:27] !log rebooting mw1101 the hard way [21:07:28] 7Puppet, 6Community-Liaison, 10MediaWiki-extensions-NavigationTiming, 6Performance-Team, and 4 others: Track state (region) - https://phabricator.wikimedia.org/T101819#1389517 (10awight) [21:07:31] Logged the message, Master [21:09:37] RECOVERY - HHVM processes on mw1101 is OK: PROCS OK: 12 processes with command name hhvm [21:09:48] RECOVERY - RAID on mw1101 is OK no RAID installed [21:10:07] 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1389538 (10fgiunchedi) @cmjohnson thanks! Also I'd like to try swapping disks on say restbase1008 and see if the failures follow the disk [21:10:07] RECOVERY - configured eth on mw1101 is OK - interfaces up [21:10:19] RECOVERY - Disk space on mw1101 is OK: DISK OK [21:10:37] RECOVERY - dhclient process on mw1101 is OK: PROCS OK: 0 processes with command name dhclient [21:10:49] RECOVERY - nutcracker port on mw1101 is OK: TCP OK - 0.000 second response time on port 11212 [21:10:58] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [21:10:58] RECOVERY - SSH on mw1101 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:10:59] RECOVERY - salt-minion processes on mw1101 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:10:59] RECOVERY - nutcracker process on mw1101 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:11:09] RECOVERY - DPKG on mw1101 is OK: All packages OK [21:11:47] RECOVERY - HHVM rendering on mw1101 is OK: HTTP OK: HTTP/1.1 200 OK - 71011 bytes in 0.156 second response time [21:13:38] (03PS1) 10Ori.livneh: Add silver to mw_appserver_networks, so it can be a deployment target [puppet] - 10https://gerrit.wikimedia.org/r/219982 [21:13:44] andrewbogott: ^ [21:13:57] (03CR) 10Ori.livneh: [C: 032] add 'scap-test' dsh group, containing 5 hosts from eqiad and 5 from codfw [puppet] - 10https://gerrit.wikimedia.org/r/219980 (owner: 10Ori.livneh) [21:14:08] RECOVERY - puppet last run on mw1101 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:18:33] (03PS2) 10Andrew Bogott: Add silver to mw_appserver_networks, so it can be a deployment target [puppet] - 10https://gerrit.wikimedia.org/r/219982 (owner: 10Ori.livneh) [21:18:53] ori: v4 was already in there, so presumably it’s a v6 issue ^ [21:19:20] andrewbogott: ah, good catch. [21:19:35] (03PS1) 10Ottomata: Add cron to import eventlogging data from Kafka via Camus [puppet] - 10https://gerrit.wikimedia.org/r/219986 [21:19:46] (03PS3) 10Andrew Bogott: Add silver to mw_appserver_networks, so it can be a deployment target [puppet] - 10https://gerrit.wikimedia.org/r/219982 (owner: 10Ori.livneh) [21:20:07] !log Depooled mw1170-mw1175 and mw1270-mw1275 for testing Idddcfe46 [21:20:11] Logged the message, Master [21:20:52] (03CR) 10Ottomata: [C: 032] Add cron to import eventlogging data from Kafka via Camus [puppet] - 10https://gerrit.wikimedia.org/r/219986 (owner: 10Ottomata) [21:21:55] (03PS4) 10Andrew Bogott: Add silver to mw_appserver_networks, so it can be a deployment target [puppet] - 10https://gerrit.wikimedia.org/r/219982 (https://phabricator.wikimedia.org/T103138) (owner: 10Ori.livneh) [21:22:45] (03CR) 10Andrew Bogott: [C: 032] Add silver to mw_appserver_networks, so it can be a deployment target [puppet] - 10https://gerrit.wikimedia.org/r/219982 (https://phabricator.wikimedia.org/T103138) (owner: 10Ori.livneh) [21:23:10] (03PS5) 10Andrew Bogott: Add silver to mw_appserver_networks, so it can be a deployment target [puppet] - 10https://gerrit.wikimedia.org/r/219982 (https://phabricator.wikimedia.org/T103138) (owner: 10Ori.livneh) [21:24:56] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1389610 (10GWicke) @eevans, https://github.com/dropwizard/metrics/blob/v2.2.0/metrics-graphite/src/main/java/com/yammer/metri... [21:25:21] andrewbogott: remember that puppet needs to run on tin for that to work [21:25:53] yep, doing [21:25:54] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1389616 (10Andrew) 5Open>3Resolved a:3Andrew OK -- all of the above is done. I've restarted with empty tables, and now we're only syncing... [21:28:28] PROBLEM - puppet last run on tin is CRITICAL puppet fail [21:28:39] heh [21:28:50] Puppet also needs to work on tin for it to work, I'd imagine [21:30:19] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:30:26] 6operations, 10Wikimedia-Bugzilla: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1389663 (10Legoktm) 3NEW a:3Dzahn [21:32:09] mutante: ^ I'm assuming that should be assigned to you? [21:32:28] RECOVERY - puppet last run on silver is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:32:45] 6operations, 6Labs, 6Release-Engineering, 10wikitech.wikimedia.org, 5Patch-For-Review: silver / scap - Could not get latest version: 403 Forbidden - https://phabricator.wikimedia.org/T103138#1389685 (10Andrew) 5Open>3Resolved a:3Andrew Fixed by attached patch. [21:32:53] yeah, the static-bugzilla not handling redirects thing is a bug I meant to file weeks ago [21:33:00] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1389690 (10Eevans) >>! In T78514#1389610, @GWicke wrote: > @eevans, https://github.com/dropwizard/metrics/blob/v2.2.0/metrics... [21:33:11] but could not be bothered because I'd just be seen as another person making it difficult to get rid of bugzilla [21:33:12] meh [21:34:08] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1389698 (10fgiunchedi) also FWIW metrics has been upgraded to 3.1.0 in https://github.com/apache/cassandra/commit/8896a70b015... [21:35:02] 6operations, 10Wikimedia-Bugzilla: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1389710 (10Legoktm) [21:35:05] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1389709 (10Legoktm) [21:35:23] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: redirect old-bugzilla to static-bugzilla - https://phabricator.wikimedia.org/T103190#1389716 (10Legoktm) Not working properly, see {T103425}. [21:41:36] !log restarting Cassandra on restbase1001 to get the metrics back [21:41:40] Logged the message, Master [21:44:10] 6operations, 6Analytics-Engineering: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1389771 (10Ottomata) [21:45:44] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1389783 (10GWicke) >>! In T78514#1389690, @Eevans wrote: > Without the test for being connected / already closed, it might ex... [21:50:10] !log trebuchet fetch for scap/scap failed on mw2086.codfw.wmnet, mw1222.eqiad.wmnet and virt1000.wikimedia.org [21:50:14] Logged the message, Master [21:51:00] virt1000 probably needs to be removed from the trebuchet targets list in redis [21:52:02] (03CR) 10Aaron Schulz: [C: 031] Enable TinyRGB ICC profile swapping on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219867 (https://phabricator.wikimedia.org/T100976) (owner: 10Gilles) [21:53:06] 6operations, 10ops-eqiad: graphite1002 slot 7 disk failed - https://phabricator.wikimedia.org/T103159#1389817 (10fgiunchedi) p:5Triage>3Normal a:3Cmjohnson [21:55:43] !log trebuchet checkout for scap/scap failed on 23 hosts: mw1104, mw1222, mw2009, mw2011, mw2021, mw2028, mw2031, mw2034, mw2069, mw2076, mw2080, mw2086, mw2095, mw2099, mw2120, mw2127, mw2131, mw2136, mw2170, mw2187, mw2189, mw2197, virt1000 [21:55:47] Logged the message, Master [21:56:15] !log updated scap to 81b7c14 (Move dsh group file names to config) [21:56:19] Logged the message, Master [21:57:48] 6operations, 10Wikimedia-Bugzilla: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1389847 (10Aklapper) @legoktm: I don't really see why that redirect should be supported or why this is high priority. Who explicitly links to old-bz URLs and why? [21:58:54] 6operations, 10ops-eqiad, 10RESTBase: investigate restbase1007 sdb failure - https://phabricator.wikimedia.org/T102557#1389850 (10fgiunchedi) sadly I'm seeing failures for `sda` on `restbase1008` too, @cmjohnson perhaps a cable check might be in order, I don't think the swap will tell us much at this point... [21:59:28] !log reboot restbase1008 [21:59:32] Logged the message, Master [22:04:11] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1389878 (10Eevans) >>! In T78514#1389783, @GWicke wrote: >>>! In T78514#1389690, @Eevans wrote: >> Without the test for being... [22:04:21] 6operations, 10Wikimedia-Bugzilla: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1389880 (10Legoktm) It should be supported because it used to work, and the expectation was that it would continue to work. Good urls don't break. As for who uses old-bugzilla, I have no idea because {... [22:08:15] !log Deployed patch for T103054 [22:08:19] Logged the message, Mr. Obvious [22:08:28] (03PS2) 10Filippo Giunchedi: racktables: increase default php memory limit [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) [22:09:47] (03CR) 10Filippo Giunchedi: "{{done}} with file_line, what do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) (owner: 10Filippo Giunchedi) [22:12:06] (03CR) 10Filippo Giunchedi: "I think we should let upstart (or systemd) do their thing and log the daemon stdout/stderr to file, I'm not sure why this wasn't done howe" [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [22:12:42] godog: so no action items here? ^ [22:14:29] matanya: maybe :) I'll ask why jobchron does its own file appending, if there isn't a particular reason then we should just let upstart do it IMO [22:14:43] thanks godog [22:16:15] jouncebot: next [22:16:15] In 0 hour(s) and 43 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150622T2300) [22:17:13] !log bd808 Synchronized README: Testing sync-file after scap update (duration: 00m 12s) [22:17:17] Logged the message, Master [22:19:28] !log scap error "@ERROR: access denied to common from localhost (127.0.0.1)" from mw2187 and mw2080 on sync-file test. [22:19:32] Logged the message, Master [22:21:23] (03PS3) 10Filippo Giunchedi: puppetmaster: wrapper script to access d-i over ssh [puppet] - 10https://gerrit.wikimedia.org/r/217016 [22:22:33] (03PS4) 10Filippo Giunchedi: puppetmaster: wrapper script to access d-i over ssh [puppet] - 10https://gerrit.wikimedia.org/r/217016 [22:22:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] puppetmaster: wrapper script to access d-i over ssh [puppet] - 10https://gerrit.wikimedia.org/r/217016 (owner: 10Filippo Giunchedi) [22:24:08] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1390006 (10fgiunchedi) [22:24:51] !log restarting Cassandra on restbase1002 to get the metrics back [22:24:56] Logged the message, Master [22:25:17] ori: both those hosts that had weird scap problems have partial clones of /srv/deployment/scap/scap [22:25:35] the git status looks good but the scap.cfg file is missing [22:26:20] They were both in the set I logged earlier that failed the fetch stage for the new version [22:26:34] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1390017 (10fgiunchedi) @cmjohnson, change of plans! we should try with spare big SSDs first and see if failures come back, please swap current SSDs on restbase1008 with two spare... [22:26:44] RoanKattouw: ^^ mystery solved (sort of) [22:27:31] OK [22:27:37] So should I repeat my sync? [22:27:50] Presumably just resyncing those machines (once they're fixed) isn't enough because they're scap proxies [22:27:53] the only hosts that would be effected are in codfw [22:28:09] so we are ok right now I thik [22:28:11] *think [22:28:12] OK [22:28:18] Yeah there'll be a scap tomorrow anyway [22:29:00] Can some root run `salt-call deploy.checkout 'scap/scap'` on mw2080.codfw.wmnet to see if there is an actionable error message? [22:30:27] 7Blocked-on-Operations, 6Discovery, 6Labs, 10Maps, and 2 others: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1390024 (10RobH) 5Open>3declined a:3RobH I'm going to go ahead and decline this outright (rather than resolve as @Yurik suggests),... [22:30:27] PROBLEM - puppet last run on rhodium is CRITICAL puppet fail [22:31:06] bd808: doing [22:31:08] bd808: running [22:31:09] ah [22:31:14] chasemp: go for it [22:31:17] i'll look at the upstart thing :) [22:31:18] done [22:31:26] Command u'/usr/bin/git checkout --force --quiet tags/scap/scap-sync-20150622-214637' failed with return code: 128 [22:31:26] [ERROR ] output: error: object file .git/objects/57/a4ca75e191a778f80d06bf1bc331d7db5d7f6a is empty [22:31:26] error: object file .git/objects/57/a4ca75e191a778f80d06bf1bc331d7db5d7f6a is empty [22:31:28] fatal: loose object 57a4ca75e191a778f80d06bf1bc331d7db5d7f6a (stored in .git/objects/57/a4ca75e191a778f80d06bf1bc331d7db5d7f6a) is corrupt [22:31:30] local: [22:31:40] git clone corruption :/ [22:33:31] !log restarting Cassandra on restbase1003 to get the metrics back [22:33:35] Logged the message, Master [22:37:33] !log restarting Cassandra on restbase1004 to get the metrics back [22:37:37] Logged the message, Master [22:38:08] PROBLEM - puppet last run on strontium is CRITICAL puppet fail [22:39:39] PROBLEM - puppet last run on labcontrol1001 is CRITICAL puppet fail [22:40:28] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1390055 (10RobH) So we cannot send these back to our vendor (HP) for coverage. They have to go back to Samsung, via the normal process anyone wants to send back to them, via thei... [22:45:25] !log restarting Cassandra on restbase1005 to get the metrics back [22:45:29] Logged the message, Master [22:51:05] !log ori Synchronized php-1.26wmf10/resources/src/mediawiki/mediawiki.Title.js: I0e5f2d3b2: Fix undeclared dependency on jquery.mwExtension (duration: 00m 12s) [22:51:10] Logged the message, Master [22:51:44] 6operations, 10RESTBase, 10RESTBase-Cassandra: begin testing Cassandra 2.1.6 - https://phabricator.wikimedia.org/T101745#1390086 (10GWicke) Sadly, the metrics all died over the weekend. Restarting the instances brings them back, but clearly their half-life is drastically reduced to 2.1.3. Otherwise 2.1.7 is... [22:55:51] bblack: Hey could you purge a URL on cp1044 (misc-web Varnish) that has stale content? [22:55:52] I'm seeing https://gist.github.com/catrope/016fd02d5bc2ee7b4c35 [22:56:21] even with -H "Cache-Control: no-cache" , but I guess that's ignored of course [22:58:02] should that file's URLs be date-versioned or something? I'm guessing there's no existing method to purge them on updates [22:58:11] (Thing to look for in that gist: different Last-Modified timestamps and different Content-Length values indicating that cp1044's version is stale) [22:58:14] (or not send cacheable headers, I donno) [22:58:34] Hmm [22:58:37] I guess what I mean is, I think varnish is doing what it was told to do here [22:58:50] I don't want to go manually purge every time someone changes a file somewhere [22:58:56] I wonder if the web server is where that CC header comes from [22:59:00] I'll do it, I'm just saying, this needs a better process [22:59:10] Well, hold on, there's no s-maxage header [22:59:13] Did Varnish eat it? [22:59:19] Or does it default to max-age? [22:59:28] yes, varnish likely rewrote it or messed with it [22:59:38] but what does the origin send with those URLs? [22:59:48] That's a good question [22:59:51] About to investigate that [23:00:04] RoanKattouw, ^d, gilles, jdlrobson, rmoen: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150622T2300). [23:00:32] Alright, SWAT time [23:00:42] gilles, jdlrobson: Please confirm you guys are here [23:00:56] RoanKattouw: I have a pulse [23:01:41] bblack: Yeah the CC header is what the Apache sent, unmolested [23:01:50] it's purged now FWIW [23:01:51] \o [23:02:01] Varnish disrespects it though [23:02:30] There is no s-maxage (although maybe it's allowed to use max-age in that situation, I don't know) and there's must-revalidate which it doesn't appear to do [23:02:45] RoanKattouw: i am indeed around. Note one of the cherry picked patches is having jenkins problems because of https://phabricator.wikimedia.org/T102674 [23:02:48] I don't think must-revalidate means what you think it means in this case [23:02:55] it's a false positive though [23:02:58] s-maxage vs maxage, maybe [23:03:00] I mean, between client and server I know what it means [23:03:05] Between proxy and server, I don't know [23:03:18] must-revalidate only applies to stale content [23:03:24] in other words, after the initial 86400 expires [23:03:24] Oooh [23:03:29] Hah [23:03:47] I guess the reason I didn't know that is because the situation in which I care has max-age=0 [23:03:54] :) [23:03:57] (Wikipedia page content) [23:04:16] yeah it's not often people actually use real ages + must-revalidate and mean what they say [23:05:08] PROBLEM - Cassanda CQL query interface on restbase1007 is CRITICAL: Connection refused [23:05:08] PROBLEM - Cassandra database on restbase1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [23:05:18] PROBLEM - RAID on restbase1007 is CRITICAL Active: 5, Working: 5, Failed: 1, Spare: 0 [23:06:13] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1390116 (10GWicke) [23:06:44] poor 1007 [23:11:59] Alright, let's SWAT [23:12:08] (03CR) 10Catrope: [C: 032] Enable TinyRGB ICC profile swapping on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219867 (https://phabricator.wikimedia.org/T100976) (owner: 10Gilles) [23:12:09] :) [23:12:14] * greg-g is about to leave for the bus [23:12:18] (03Merged) 10jenkins-bot: Enable TinyRGB ICC profile swapping on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219867 (https://phabricator.wikimedia.org/T100976) (owner: 10Gilles) [23:12:21] alright! [23:12:54] !log catrope Synchronized wmf-config/InitialiseSettings.php: Enable TinyRGB ICC profile swapping on testwiki (duration: 00m 13s) [23:12:58] Logged the message, Master [23:13:07] gilles: ---^^ [23:13:28] RoanKattouw: change should be a no-op at the point, testwiki needs the new code for the config value to be used [23:13:46] not worth backporting [23:13:59] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1390148 (10Addshore) Well... https://meta.wikimedia.org/w/index.php?title=Www.wikimedia.org_template&diff=next&oldid=12369391 The... [23:14:03] I'll keep an eye on testwiki during the train deploy, but the change poses very little risk [23:14:11] That code rolls out tomorrow? [23:14:22] yes [23:14:39] bd808: Still getting those same scap erros [23:17:13] !log catrope Synchronized php-1.26wmf10/extensions/MobileFrontend/: SWAT (duration: 00m 15s) [23:17:17] Logged the message, Master [23:17:41] !log catrope Synchronized php-1.26wmf10/extensions/Gather: SWAT (duration: 00m 12s) [23:17:45] Logged the message, Master [23:23:09] jdlrobson, rmoen: ---^^ [23:23:21] RoanKattouw: sweet. [23:23:47] RoanKattouw: thanks. Checking that out [23:24:01] RoanKattouw: so both should be live on enwiki ? [23:25:45] RoanKattouw: not seeing them yet... [23:26:14] jdlrobson: no settings or collections link for me either [23:26:26] Oh, crap [23:26:29] I'm an idiot [23:26:32] git submodule update :D [23:27:09] !log catrope Synchronized php-1.26wmf10/extensions/Gather: For real this time (duration: 00m 13s) [23:27:13] Logged the message, Master [23:27:34] !log catrope Synchronized php-1.26wmf10/extensions/MobileFrontend: For real this time (duration: 00m 14s) [23:27:38] Logged the message, Master [23:28:13] Nice [23:28:34] RoanKattouw, jdlrobson: I can confirm a settings menu item :) [23:28:42] (03CR) 10Mobrovac: [C: 04-1] "I wonder how the jenkins tests passed, since this obviously needs to depend on https://gerrit.wikimedia.org/r/#/c/219177/ , so -1 because " [puppet] - 10https://gerrit.wikimedia.org/r/219811 (https://phabricator.wikimedia.org/T98851) (owner: 10Muehlenhoff) [23:30:00] yeh that's working.. not seeing the other fix though... [23:30:18] jdlrobson: I'm seeing view-border-box on body. That correct ? [23:30:28] oh you are? good... cos i'm not :/ [23:30:39] also normal sized log out icon [23:30:44] how strange [23:30:59] * rmoen does hard refresh [23:31:38] rmoen seeing it? I think if you are seeing it that's good enough.. but i'm not seeing it [23:31:45] Yeah [23:32:01] jdlrobson, RoanKattouw fixes confirmed on my end [23:32:11] ok good enough for me :) thanks RoanKattouw ! :) [23:32:26] indeed thanks RoanKattouw Now i'm going to read about proxycommand [23:33:48] 7Blocked-on-Operations, 6Discovery, 6Labs, 10Maps, and 2 others: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1390198 (10yuvipanda) 5declined>3Open It should still happen at some point - yurik isn't the only user of the machine and others wil... [23:34:32] bblack: Re that Varnish caching thing, I found that the web server had recently (~1 week ago) been reconfigured to send those CC headers https://gerrit.wikimedia.org/r/#/c/218534/2/modules/statistics/files/datasets.wikimedia.org [23:34:42] bblack: So what I'm seeing is "expected" behavior [23:35:04] Which means you don't need to worry about anything and different people need to be complained to instead :) [23:36:07] (03PS1) 10Filippo Giunchedi: puppetmaster: split frontend scripts [puppet] - 10https://gerrit.wikimedia.org/r/220023 [23:36:36] RoanKattouw: I did purge back when we talked earlier, FWIW [23:36:46] paravoid: ^ I broke the puppet run on backends adding new_install key dependency [23:36:46] Yeah thanks for that [23:36:50] I mostly brought up the rest because I don't want this to be a recurring request :) [23:36:58] I'll complain on the bug [23:37:19] Because they're setting max-age=86400 when serving a directory whose contents are rsynced in by a cron job that runs every 30 minutes [23:38:00] :) [23:38:23] http://martinfowler.com/bliki/TwoHardThings.html [23:38:34] this is like the 5th time I've gotten to quote that in the past week [23:38:37] it's awesome [23:41:47] RoanKattouw: there are corrupt git clones of /srv/deployment/scap/scap on several hosts in the cluster apparently. Trebuchet doesn't have a "--force" flag or anything I can use to clean them up so this needs root intervention [23:42:16] RoanKattouw: I !logged the list of hosts earlier. I'll open a phab task about it [23:42:33] OK [23:42:49] !log restarted Cassandra on restbase1006 [23:42:54] Logged the message, Master [23:44:12] bblack: Hmm, do we not have hash-based FE->BE routing on the misc-web varnishes? [23:44:34] there is no BE on misc-web I don't think [23:44:39] just one layer [23:44:41] OK [23:44:53] If we had the two-layer system, a split-brain like what I found shouldn't happen, right? [23:44:55] but I need to double-check that, I'm pretty useless right now [23:45:12] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1390242 (10chasemp) >>! In T103360#1388653, @chasemp wrote: > [] **Confd service is not starting / running** > > Proposal: fix upstart configuration and fix systemd as well (secon... [23:45:27] yes misc is 1layer, but that has nothing to do with the split-brain thing [23:45:43] either way it has two separate machines as frontends, and it could be cache differently on each within the rules [23:45:53] I mean I know that Apache was sending cache headers with an expiry that's 48x too high [23:46:04] Ooh, right [23:46:16] The frontends could still hit the backend at different times [23:47:07] 6operations, 10Parsoid, 6Services, 10service-template-node, 7service-runner: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1390255 (10mobrovac) [23:47:52] RoanKattouw: right, especially if it's not a popular URL. probably the one with the old content was one someone actually happened to hit before, and the one with the new content never hit that URL before you asked after the change. [23:50:53] 6operations, 10Deployment-Systems: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390287 (10bd808) 3NEW [23:51:05] 6operations, 10Deployment-Systems: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390294 (10bd808) p:5Triage>3High [23:51:26] RoanKattouw: ^ [23:52:10] (03PS1) 10Rush: fix issues with pybal-eval-check.py [puppet] - 10https://gerrit.wikimedia.org/r/220024 [23:53:00] Thanks [23:53:42] (03PS2) 10Rush: fix issues with pybal-eval-check.py [puppet] - 10https://gerrit.wikimedia.org/r/220024 [23:53:50] I'll take a look [23:55:00] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390295 (10chasemp) [23:55:08] godog: thanks [23:55:32] godog: are you looking at T103441? [23:55:52] (03CR) 10Rush: [C: 032] fix issues with pybal-eval-check.py [puppet] - 10https://gerrit.wikimedia.org/r/220024 (owner: 10Rush) [23:56:12] chasemp: yeah, I don't want to overlap though, any thoughts on it? [23:56:24] I was just going to do waht bd808 suggested :) [23:56:35] i really don't know trebuchet / scap behavior well [23:56:59] but if there is some ongoing failure that puts us back where we are now [23:57:03] idk [23:58:06] heh not sure either, is this the first time that git checkouts come up corrupt? [23:58:17] bd808: ^ [23:58:28] it it possible that the clones got corrupted when the ipv6 interface was added to tin and it started denying access [23:58:55] This is the first time I've seen that many of them broken [23:59:10] but it has happened occasionally in the past [23:59:38] mh on mw1222 for example git was choking on -rw-r--r-- 1 root root 0 Feb 17 16:09 /srv/deployment/scap/scap/.git/objects/ff/ff40f21fd3a735ca0b54fa035ccc4251247286