[00:06:55] !log mwscript deleteEqualMessages.php --wiki metawiki --delete [00:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:39] !log mwscript deleteEqualMessages.php --wiki ruwiki (T45917) [00:21:40] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [00:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:13] (03CR) 10Luke081515: [C: 031] Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm) [00:28:25] (03PS1) 10Krinkle: wmfstatic: Change message "Invalid path type" to "Invalid file type" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281383 [00:28:38] (03CR) 10Krinkle: [C: 032] wmfstatic: Change message "Invalid path type" to "Invalid file type" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281383 (owner: 10Krinkle) [00:29:04] (03Merged) 10jenkins-bot: wmfstatic: Change message "Invalid path type" to "Invalid file type" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281383 (owner: 10Krinkle) [00:29:12] (03CR) 10Krinkle: [C: 032] Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm) [00:29:37] (03Merged) 10jenkins-bot: Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm) [00:30:15] !log krinkle@tin Synchronized w/static.php: (no message) (duration: 00m 35s) [00:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:27] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [00:46:38] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [00:47:32] Krinkle: Did you synched the right file at https://gerrit.wikimedia.org/r/281318? Seems like it is currently not live at test2 [00:47:44] Luke|away: I didn't sync yet [00:47:51] ah, ok [00:48:08] I jsut wondered, because the task is closed, but the change is not live ;) [00:48:20] Syncing now [00:48:26] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [00:49:07] Luke|away: If you enable ChromeWikimediaDebug and set to mw1017, you'll see the change live [00:49:13] Deploying to the main cluster now [00:49:17] !log krinkle@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 29s) [00:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:49:24] thanks [00:49:48] See https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [00:50:08] Most of the time the canary server is the same as the rest. But during deployment it can be different (either next version, or some other temporary experiment) [00:50:46] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [02:19:28] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0] [02:25:08] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 11m 45s) [02:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Apr 4 02:33:39 UTC 2016 (duration 8m 31s) [02:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:58:47] PROBLEM - HHVM rendering on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.002 second response time [03:02:28] RECOVERY - HHVM rendering on mw1113 is OK: HTTP OK: HTTP/1.1 200 OK - 73081 bytes in 0.115 second response time [03:05:38] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [03:06:07] PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100% [03:09:37] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [03:09:47] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [03:10:16] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [03:11:29] YuviPanda, Phabricator is 503'ing for me. [03:15:12] matt_flaschen: 03:06 < icinga-wm> PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100% [03:15:17] so, yeah :/ [03:15:48] !log Phabricator is down [03:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:36:52] :/ [03:38:29] :( [03:40:00] do ops get paged when hosts go down? I don't know how icinga is set up for this sort of thing [03:47:57] PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 177759 MB (3% inode=99%) [04:02:12] depends on the service [04:08:38] it says it's powered off [04:11:44] http://paste.tstarling.com/p/qbrBRb.html [04:12:42] the system powered itself off? [04:12:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [04:12:46] apparently [04:12:51] I guess I will try turning it on [04:13:17] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [04:13:44] !log attempting to turn iridium back on via drac. "getraclog" says it powered itself off after resetting four times [04:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:16:00] I'm on the serial console now, it is booting [04:16:26] RECOVERY - Host iridium is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [04:16:50] well, phab is back up [04:18:47] !log iridium came back up, but mcelog reports high CPU temperature prior to the shutdown [04:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:19:57] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [04:22:06] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [04:22:27] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [04:23:16] PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 176743 MB (3% inode=99%) [04:30:17] PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 176537 MB (3% inode=99%) [04:52:27] 6Operations, 10ops-eqiad, 10Phabricator: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2176955 (10Peachey88) [04:58:57] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [05:02:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [05:19:17] PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 174398 MB (3% inode=99%) [05:31:43] 6Operations, 6Revision-Scoring-As-A-Service, 13Patch-For-Review: uwsgi takes a long time to restart (Debian Jessie in labs) - https://phabricator.wikimedia.org/T118495#2176972 (10Ladsgroup) [05:33:01] 6Operations, 6Revision-Scoring-As-A-Service, 13Patch-For-Review: uwsgi takes a long time to restart (Debian Jessie in labs) - https://phabricator.wikimedia.org/T118495#1801807 (10Ladsgroup) a:3Ladsgroup [05:37:10] (03PS2) 10Ladsgroup: ores: do git clone in staging [puppet] - 10https://gerrit.wikimedia.org/r/281228 [05:46:07] (03PS44) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (https://phabricator.wikimedia.org/T130404) [06:34:17] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:07] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:03] !log installing apt bugfix updates [06:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:57] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:07] PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 175589 MB (3% inode=99%) [07:21:34] 6Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2177064 (10KartikMistry) [07:44:37] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: puppet fail [07:48:27] 6Operations: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746#2177080 (10MoritzMuehlenhoff) [07:48:51] (03Abandoned) 10Reedy: Add Newsletter extension to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281249 (https://phabricator.wikimedia.org/T127297) (owner: 1001tonythomas) [07:49:51] 6Operations, 10Incident-20150205-SiteOutage, 7Availability, 13Patch-For-Review: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#2177082 (10Joe) p:5High>3Normal a:5Joe>3None [07:55:53] (03CR) 10Mxn: [C: 031] "Sorry for the delay. Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280456 (https://phabricator.wikimedia.org/T130514) (owner: 10Thcipriani) [08:12:47] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:25:33] 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2177093 (10ema) Mostly out of curiosity, I've checked which protocols are supported by other top-10 websites by looking at NPN responses: | google.com / youtube.com | h2, spdy/3.1, htt... [08:35:00] (03PS1) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) [08:38:54] 6Operations, 10Incident-20150205-SiteOutage, 7Availability, 13Patch-For-Review: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#2177112 (10Joe) I release this ticket as: # I don't have ideas/time to work on it # This is not happenin... [08:39:56] 6Operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#2177114 (10Joe) [08:39:58] 6Operations, 7Puppet, 10Salt, 13Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#2177113 (10Joe) 5Open>3Resolved [08:47:25] (03CR) 10Ema: [C: 032 V: 032] "These changes have been running in prod on the maps cluster for a few days without issues." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280612 (owner: 10BBlack) [08:49:53] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177126 (10Joe) [08:50:04] (03CR) 10Ema: [C: 032 V: 032] Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280198 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [08:50:27] (03CR) 10Ema: [C: 032 V: 032] Remove loglines cache to mitigate a possible memory leak. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [08:50:46] (03CR) 10Ema: [C: 032 V: 032] Code cleanup [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280294 (owner: 10BBlack) [08:50:46] !log installing gnupg updates [08:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:51:02] (03CR) 10Ema: [C: 032 V: 032] Remove format.key feature [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280295 (owner: 10BBlack) [08:51:17] (03CR) 10Ema: [C: 032 V: 032] remove a couple of inline attrs [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280377 (owner: 10BBlack) [08:51:24] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177141 (10Joe) [08:51:33] (03CR) 10Ema: [C: 032 V: 032] split lp->match allocation from lp itself [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280378 (owner: 10BBlack) [08:51:49] (03CR) 10Ema: [C: 032 V: 032] remove lp->tmpbuf, bump scratch default to 4MB [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280379 (owner: 10BBlack) [08:52:10] (03CR) 10Ema: [C: 032 V: 032] refactor match_assign/scratch/parser stuff [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280380 (owner: 10BBlack) [08:52:32] (03CR) 10Ema: [C: 032 V: 032] minor cleanups from cppcheck [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280381 (owner: 10BBlack) [08:56:55] 6Operations, 7Puppet, 10Wikimedia-Apache-configuration: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2177147 (10Joe) [08:56:57] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177146 (10Joe) [08:57:30] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2177149 (10Joe) [08:57:32] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177126 (10Joe) [09:00:19] (03PS6) 10Volans: DB: Expose Puppet SSL certs and generate CA cert [puppet] - 10https://gerrit.wikimedia.org/r/279596 (https://phabricator.wikimedia.org/T111654) [09:00:46] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2177150 (10Joe) [09:02:14] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177126 (10Joe) [09:02:16] 6Operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2177164 (10Joe) [09:07:44] 6Operations, 7Puppet, 10MediaWiki-General-or-Unknown: Profile and reduce the puppet execution time on the appservers - https://phabricator.wikimedia.org/T131750#2177166 (10Joe) [09:07:51] 6Operations, 7Puppet, 10MediaWiki-General-or-Unknown: Profile and reduce the puppet execution time on the appservers - https://phabricator.wikimedia.org/T131750#2177166 (10Joe) p:5Triage>3Normal [09:08:01] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177126 (10Joe) p:5Triage>3Normal [09:10:23] !log Disabling Puppet on cluster mysql and parsercache to merge and test change 279596 on db2040, T111654 [09:10:24] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [09:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:51] (03CR) 10Volans: [C: 032] DB: Expose Puppet SSL certs and generate CA cert [puppet] - 10https://gerrit.wikimedia.org/r/279596 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:23:39] 6Operations, 10Salt: take steps outlined at techops offiste to (try to) address salt reliability - https://phabricator.wikimedia.org/T115292#2177187 (10ArielGlenn) [09:23:41] 7Blocked-on-Operations, 6Operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#2177188 (10ArielGlenn) [09:23:43] 6Operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#2177185 (10ArielGlenn) 5Open>3Resolved And it's done, even with the blocking task listed about parsoid. They have a workaround they have been using for months. Thanks Joe for fixing up wmf... [09:24:04] (03PS4) 10Gehel: Don't create new log files for cirrus-suggest with logrotate [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [09:25:42] (03CR) 10Gehel: "I rebased that change as half of it was already on the production branch. It now becomes a simple one line change (not that it was overly " [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [09:26:51] !log Re-enabling Puppet on cluster mysql and parsercache to deploy change 279596, T111654 [09:26:52] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [09:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:18] !log deploy restbase ba39d2bcd2f5 to restbase2004 before repooling [09:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:37:23] !log repool restbase2004 [09:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: make base class trusty and forward only [puppet] - 10https://gerrit.wikimedia.org/r/281407 (https://phabricator.wikimedia.org/T126310) [09:39:24] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281408 (https://phabricator.wikimedia.org/T126310) [09:39:26] (03PS1) 10Giuseppe Lavagetto: mediawiki::packages: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) [09:39:30] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop precise support [puppet] - 10https://gerrit.wikimedia.org/r/281410 (https://phabricator.wikimedia.org/T126310) [09:39:33] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop compatibility with precise, mod_php [puppet] - 10https://gerrit.wikimedia.org/r/281411 (https://phabricator.wikimedia.org/T126310) [09:39:35] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281412 (https://phabricator.wikimedia.org/T126310) [09:39:52] * _joe_ spring cleaning [09:41:41] !log depool restbase2003 before raid expansion [09:41:43] (03PS1) 10Elukey: Add info to the varnishkafka README after the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/281413 (https://phabricator.wikimedia.org/T124278) [09:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:03] !log de-pooled aqs1001.eqiad.wmnet as pre-step for nodejs upgrade [09:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:39] elukey: nerver mind :) [09:54:40] !log nginx rolling restart for openssl upgrade: cp1046, cp1052, cp1068, cp1071, cp1099 [09:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:26] 7Blocked-on-Operations, 6Operations, 10RESTBase-Cassandra: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951#2177214 (10fgiunchedi) @eevans, yes, restbase2003 is expanding its raid0 ATM [09:57:31] !log start expanding raid0 on restbase2003 [09:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:53] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2177215 (10hashar) p:5High>3Normal [09:59:47] !log installing pcre3 updates [09:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:00:04] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2158668 (10hashar) 5Open>3stalled I have updated the task detail to: * point to @akosiaris comment explaining how to resume the queue processing. * Mention 2.8.4 will fix it This... [10:01:35] (03PS1) 10ArielGlenn: add scap config file for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 [10:02:08] (03PS2) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 [10:03:08] (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn) [10:07:38] (03CR) 10Hashar: "I got rid of grunt-cli local install on all slaves a few minutes ago." [puppet] - 10https://gerrit.wikimedia.org/r/280974 (https://phabricator.wikimedia.org/T124474) (owner: 10Hashar) [10:08:47] PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 177995 MB (3% inode=99%) [10:11:29] taking a look ^ [10:13:26] (03PS2) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) [10:14:38] (03PS3) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 [10:15:52] (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn) [10:17:14] (03PS4) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 [10:18:38] (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn) [10:19:26] RECOVERY - Disk space on restbase2004 is OK: DISK OK [10:19:49] (03PS5) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 [10:20:18] (03PS2) 10Elukey: Add info to the varnishkafka README after the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/281413 (https://phabricator.wikimedia.org/T124278) [10:21:14] (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn) [10:22:48] (03PS6) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 [10:23:01] !log reduce reserved blocks for /srv on restbase2004 [10:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:49] (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn) [10:26:04] (03PS7) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 [10:26:07] mondays [10:26:08] hate em [10:28:39] (03CR) 10ArielGlenn: [C: 032] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn) [10:35:59] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop the HHVM define and mod_php [puppet] - 10https://gerrit.wikimedia.org/r/281418 (https://phabricator.wikimedia.org/T126310) [10:37:18] <_joe_> anyone up for reviewing ^^? [10:37:19] <_joe_> :P [10:38:10] (03CR) 10Alex Monk: "does not match existing code style" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [10:38:21] (03CR) 10Elukey: [C: 032] Add info to the varnishkafka README after the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/281413 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [10:38:31] (03CR) 10Elukey: [V: 032] Add info to the varnishkafka README after the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/281413 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [10:42:26] !log re-pooled aqs1001.eqiad (no node upgrade, need more info about restbase) [10:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:10] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop apache 2.2 support [puppet] - 10https://gerrit.wikimedia.org/r/281419 (https://phabricator.wikimedia.org/T126310) [10:48:22] 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2177315 (10elukey) [10:48:24] 6Operations, 6Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2177313 (10elukey) 5Open>3Resolved Code merged by ema, plus the varnish maps cluster has been running with vk for days without triggering any... [10:48:58] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:49:26] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Puppet has 1 failures [10:54:18] RECOVERY - DPKG on labmon1001 is OK: All packages OK [11:08:25] apergos: hi. r: https://phabricator.wikimedia.org/T127793 script for dump is available in ContentTranslation and default parameters are fine as of now. [11:08:36] apergos: let me know any more info we need. [11:08:40] thank you. [11:08:50] nag me in two day splease if you haven't seen anything on the ticket. [11:09:13] apergos: we can 'split-at' if dump goes larger. [11:09:22] apergos: sure. thanks. [11:09:55] apergos: just need to finalize frequency. we can start with weekly, wait for dump size and decide. [11:15:41] (03PS3) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) [11:16:27] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [11:18:01] (03CR) 10Urbanecm: "@Alex: What code style do you mean? I tried to do my best. If you meant the indent, I fixed it in patch 3. Sorry for it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [11:21:02] (03CR) 10Alex Monk: "The rest of the file has spaces around the parameters to array() and spaces after the //. These additions don't" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [11:29:23] (03PS2) 10Filippo Giunchedi: graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) [11:30:16] (03PS4) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) [11:30:31] (03CR) 10jenkins-bot: [V: 04-1] graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [11:30:47] (03CR) 10Urbanecm: "I fixed it. Is it ok?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [12:05:53] 6Operations, 7Puppet, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2177350 (10Joe) p:5Triage>3Normal a:3Joe [12:07:33] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2177353 (10Joe) [12:08:06] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2177150 (10Joe) p:5Triage>3Normal a:3Joe [12:10:47] PROBLEM - kartotherian on maps-test2004 is CRITICAL: Connection refused [12:11:56] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Convert the hhvm puppet module to be compatible with Debian jessie - https://phabricator.wikimedia.org/T131756#2177370 (10Joe) [12:12:07] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Convert the hhvm puppet module to be compatible with Debian jessie - https://phabricator.wikimedia.org/T131756#2177370 (10Joe) a:5Joe>3None [12:14:07] (03PS1) 10ArielGlenn: set up for keyholder for dumps deployment from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/281425 [12:15:07] PROBLEM - tilerator on maps-test2004 is CRITICAL: Connection refused [12:15:27] PROBLEM - tileratorui on maps-test2004 is CRITICAL: Connection refused [12:16:07] RECOVERY - kartotherian on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 949 bytes in 0.083 second response time [12:16:22] <_joe_> what happened to maps-test2004? [12:16:57] RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.088 second response time [12:17:17] RECOVERY - tileratorui on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.089 second response time [12:18:46] PROBLEM - RAID on stat1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) [12:30:44] outch --^ [12:32:41] (03PS1) 10ArielGlenn: setup for dumps deployment public key on snapshot hosts for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281428 [12:34:27] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [12:43:31] 6Operations, 10ops-eqiad, 10Phabricator: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2176955 (10Cmjohnson) There has been several severs with cpu heat issues over the last few months. Re-applying thermal paste has been an effective fix. Iridium i... [12:43:36] (03CR) 10Dereckson: Enable Translate extension on uawikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [12:43:53] (03CR) 10BBlack: [C: 031] resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [12:45:08] (03CR) 10Dereckson: [C: 031] Two permission changes at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281314 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515) [12:47:55] (03PS4) 10BBlack: tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 [12:49:18] (03CR) 10BBlack: [C: 032] tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 (owner: 10BBlack) [12:50:50] (03CR) 10Dereckson: "@awight What's the exact goal of this change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [12:52:45] (03CR) 10Dereckson: [C: 04-1] "Misleading commit message, lack of clear rationale." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [12:53:06] 6Operations, 10Traffic, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#2177434 (10BBlack) Note this should get resolved via T130414 's https://gerrit.wikimedia.org/r/#/c/278353 [12:55:58] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures [12:55:58] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures [12:56:47] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [12:56:47] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures [12:57:47] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [12:57:55] annoying :P [13:00:08] 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2177437 (10ema) 5Open>3Resolved [13:03:33] (03PS2) 10ArielGlenn: set up for keyholder for dumps deployment from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/281425 [13:03:54] (03PS1) 10BBlack: tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 [13:04:57] (03CR) 10jenkins-bot: [V: 04-1] tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 (owner: 10BBlack) [13:05:34] (03CR) 10ArielGlenn: [C: 032] set up for keyholder for dumps deployment from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/281425 (owner: 10ArielGlenn) [13:05:52] (03PS2) 10BBlack: tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 [13:06:20] (03PS2) 10ArielGlenn: setup for dumps deployment public key on snapshot hosts for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281428 [13:08:03] (03CR) 10ArielGlenn: [C: 032] setup for dumps deployment public key on snapshot hosts for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281428 (owner: 10ArielGlenn) [13:14:05] (03PS3) 10BBlack: tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 [13:14:12] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 (owner: 10BBlack) [13:16:27] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:18:07] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [13:18:57] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [13:19:17] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures [13:19:17] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:19:57] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:20:38] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:21:06] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:21:06] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:21:57] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [13:22:06] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [13:22:48] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [13:22:49] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [13:22:50] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [13:23:46] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:23:47] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:23:57] puppet should be fine on caches, this is just the delayed fallout of the "bad patch" -> "disable puppet" -> "merge fix" -> "enable + run puppet" cycle, which ends up notifying of failures along the way [13:24:27] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [13:24:27] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:24:36] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:24:36] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:24:36] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [13:26:17] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:26:26] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:28:55] (03CR) 10Luke081515: "reckeck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [13:29:07] (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [13:30:22] 6Operations, 10ops-eqiad: stat1002 broken disk causing degraded RAID array - https://phabricator.wikimedia.org/T131758#2177452 (10elukey) [13:32:01] (03PS1) 10ArielGlenn: enable keyholder and scap cfg setup on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281435 [13:39:04] 6Operations, 10Traffic, 7Varnish: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2177485 (10ema) [13:40:24] (03CR) 10ArielGlenn: [C: 032] enable keyholder and scap cfg setup on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281435 (owner: 10ArielGlenn) [13:40:36] !log nginx rolling restart for openssl upgrade on cache hosts [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:13] 6Operations, 10Traffic, 7Varnish: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2177507 (10ema) p:5Triage>3Normal [13:47:18] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2177508 (10Joe) p:5Triage>3Normal a:3Joe [13:47:38] (03PS1) 10ArielGlenn: fix typo in name of private key file for dumps dpeloyment [puppet] - 10https://gerrit.wikimedia.org/r/281438 [13:48:12] ciao ema [13:48:20] hey Nemo_bis [13:48:30] non avevo ancora mremorizzato il tuo nick [13:49:01] (03CR) 10ArielGlenn: [C: 032] fix typo in name of private key file for dumps dpeloyment [puppet] - 10https://gerrit.wikimedia.org/r/281438 (owner: 10ArielGlenn) [13:49:06] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [13:50:35] (03PS1) 10BBlack: varnish+statsd: refactor classes, move rls to text-only [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) [13:51:28] 6Operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2177518 (10Joe) [13:51:31] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2177517 (10Joe) [13:51:51] PROBLEM - too many italian words registered in the channel :D [13:52:03] indeed [13:52:26] (03CR) 10BBlack: "@ori: it seems like rls should be text-cluster-only (where load.php lives). I was noting it seems to log some basic headers for all reqs," [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) (owner: 10BBlack) [13:52:37] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:53:48] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [13:54:27] 6Operations, 6Labs, 13Patch-For-Review: Labtest designate giving out Forbidden exceptions when trying to list domains - https://phabricator.wikimedia.org/T130979#2177525 (10chasemp) p:5Triage>3Low [13:57:28] (03CR) 10Base: Enable Translate extension on uawikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [13:58:40] (03PS1) 10ArielGlenn: start new snapshot role with the basics at first: scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/281440 [14:00:12] (03CR) 10Ema: [C: 031] "Checked with pcc, works as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) (owner: 10BBlack) [14:00:54] (03CR) 10ArielGlenn: [C: 032] start new snapshot role with the basics at first: scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/281440 (owner: 10ArielGlenn) [14:01:49] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2177536 (10chasemp) p:5Triage>3Normal [14:06:48] 6Operations, 6Labs: Can't create account "Bishoy Camel" (user with a former SVN account not migrated) - https://phabricator.wikimedia.org/T128833#2177556 (10chasemp) p:5Triage>3High [14:07:30] 6Operations, 6Labs, 10Monitoring, 10Tool-Labs: Make icinga-wm report Tools homepage check at #wikimedia-labs, too - https://phabricator.wikimedia.org/T128716#2177558 (10chasemp) p:5Triage>3Low [14:07:37] 6Operations, 6Labs, 10Monitoring, 10Tool-Labs: Add other Tools administrators to the Icinga notification group - https://phabricator.wikimedia.org/T128715#2177559 (10chasemp) p:5Triage>3Normal [14:07:59] 6Operations, 6Labs, 10Tool-Labs: Get rid of Tool Labs home page check from shinken - https://phabricator.wikimedia.org/T128615#2177561 (10chasemp) p:5Triage>3Normal [14:08:40] (03PS1) 10ArielGlenn: enable new snapshot role on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/281442 [14:10:35] (03PS5) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) [14:13:05] (03PS6) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) [14:14:07] (03CR) 10ArielGlenn: [C: 032] enable new snapshot role on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/281442 (owner: 10ArielGlenn) [14:17:31] if icinga whines aobut puppet errors on snapshot1005, please ignore, I'm working on it [14:20:36] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:20:37] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures [14:23:58] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [14:26:43] (03PS2) 10Ldaptestaccount123: Tools: Add mytop [puppet] - 10https://gerrit.wikimedia.org/r/272435 (https://phabricator.wikimedia.org/T58999) (owner: 10BryanDavis) [14:27:52] (03CR) 10Dereckson: "Logic looks good to me." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [14:27:53] sorry what? [14:27:59] ldaptestaccount what now? [14:28:36] 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177593 (10Papaul) [14:30:59] (03PS1) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) [14:31:19] (03CR) 10jenkins-bot: [V: 04-1] Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [14:35:13] 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177610 (10BBlack) [14:35:44] (03PS2) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) [14:35:45] 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177627 (10Papaul) Network ports information WMF6404 ge-5/0/23 row A rack A5 WMF6405 ge-5/0/14 row C rack C5 WMF6406 ge-5/0/15 row C rack C5 WMF6407 ge-5/0/09 row D rack D5 WMF6408... [14:36:22] 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177632 (10Papaul) [14:36:43] (03PS1) 10ArielGlenn: scap3 for dumps needs the scap dir group writeable by wikidev it seems [puppet] - 10https://gerrit.wikimedia.org/r/281445 [14:38:04] (03CR) 10ArielGlenn: [C: 032] scap3 for dumps needs the scap dir group writeable by wikidev it seems [puppet] - 10https://gerrit.wikimedia.org/r/281445 (owner: 10ArielGlenn) [14:39:37] (03PS1) 10Filippo Giunchedi: installserver: port squid3 changes for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) [14:46:11] !log de-pooled aqs1001.eqiad from the confd pool for nodejs upgrade [14:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:54] (03PS4) 10Rush: Tools: Add dev packages needed to compile python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/272415 (https://phabricator.wikimedia.org/T114388) (owner: 10BryanDavis) [14:48:11] (03PS7) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) [14:48:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [14:49:53] (03CR) 10Urbanecm: "@Dereckson: Fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [14:50:54] (03CR) 10Rush: [C: 032] Tools: Add dev packages needed to compile python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/272415 (https://phabricator.wikimedia.org/T114388) (owner: 10BryanDavis) [14:51:01] (03PS3) 10Rush: Tools: Add mytop [puppet] - 10https://gerrit.wikimedia.org/r/272435 (https://phabricator.wikimedia.org/T58999) (owner: 10BryanDavis) [14:51:11] (03CR) 10Rush: [C: 032 V: 032] Tools: Add mytop [puppet] - 10https://gerrit.wikimedia.org/r/272435 (https://phabricator.wikimedia.org/T58999) (owner: 10BryanDavis) [14:52:09] (03CR) 10Thcipriani: "One inline comment/question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac) [14:53:36] PROBLEM - AQS root url on aqs1001 is CRITICAL: Connection refused [14:54:08] PROBLEM - cassandra CQL 10.64.0.123:9042 on aqs1001 is CRITICAL: Connection refused [14:54:16] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.123, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:54:56] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection refused [14:55:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [14:55:15] !log Restarting restbase2004-a.codfw.wmnet (cancelling bootstrap of 2004-b) [14:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:38] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: /pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /unique-devices/{p [14:55:39] PROBLEM - Analytics Cassandra database on aqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [14:55:39] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: /pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /unique-devices/{p [14:56:07] PROBLEM - cassandra service on aqs1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [14:56:37] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [14:57:08] RECOVERY - AQS root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.033 second response time [14:57:27] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [14:57:28] RECOVERY - Analytics Cassandra database on aqs1001 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [14:57:28] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [14:57:48] RECOVERY - cassandra CQL 10.64.0.123:9042 on aqs1001 is OK: TCP OK - 0.008 second response time on port 9042 [14:57:56] RECOVERY - cassandra service on aqs1001 is OK: OK - cassandra is active [14:57:57] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [14:58:37] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.007 second response time on port 9042 [15:00:05] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T1500). [15:00:05] bearND mdholloway: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:43] 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177744 (10BBlack) On point 3 above (when does varnish send TE:chunked?), my best observations/code-searching indicate: 1. Obviously a do_stream of a chunked fetch is a chun... [15:01:39] hi [15:01:48] bearND: Hi, I can SWAT for you :) [15:02:04] thcipriani: thanks [15:03:21] (03PS1) 10Papaul: DNS: Adding mgmt DNS for spare pool servers Bug: T130941 [dns] - 10https://gerrit.wikimedia.org/r/281449 (https://phabricator.wikimedia.org/T130941) [15:04:23] 7Blocked-on-Operations, 6Operations, 10RESTBase, 10RESTBase-Cassandra, 13Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2177752 (10fgiunchedi) on sunday restbase2004 ran out of disk space while bootstrapping 2004-b ``` 12:... [15:06:03] 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177758 (10Papaul) [15:08:10] 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177761 (10Papaul) a:5Papaul>3RobH @ Please update switch ports and resolve this task once complete. Thanks [15:08:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:08:48] 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177767 (10BBlack) And on point 2 above: since varnish seems to be smart about using TE:chunked only when the response length isn't easy to know, there's not much wiggle room... [15:09:02] 6Operations, 10Traffic, 7Varnish: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2168861 (10ema) a:3ema [15:11:17] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:13:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:15:21] 6Operations, 10ops-eqiad, 10netops: investigate why mr1-eqiad randomly rebooted - https://phabricator.wikimedia.org/T131379#2177774 (10faidon) 5Open>3declined I re-rebooted it from the console, as it wasn't able to read th SSH keys (!? the CF is maybe broken?) and hence sshd was unable to start. It works... [15:18:40] !log thcipriani@tin Synchronized php-1.27.0-wmf.19/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android production app: 50% [[gerrit:280957]] (duration: 00m 46s) [15:18:41] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177789 (10chasemp) p:5Triage>3High [15:18:43] ^ bearND check please [15:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:42] thcipriani: looks good, thank you [15:19:53] bearND: cool, thanks for checking! [15:21:49] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177807 (10chasemp) Some details from an email @tstarling sent out as a notice ```...So I powered it up, and it came up, but /var/lo... [15:23:24] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team, 15User-greg: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177828 (10chasemp) a:3greg tossing your way because of the nature of the issue and need for immediate feedback for... [15:23:38] (03PS12) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) [15:24:11] 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177838 (10BBlack) Going into more detail on the current behaviors of misc and upload clusters: **cache_misc** - regardless of tier/layer, it sets do_stream for objects >= 1... [15:24:53] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team, 15User-greg: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177840 (10greg) yeah, stab in the dark guess is it might happen again tonight after the dumps (I assume?) run. When... [15:24:55] (03PS1) 10ArielGlenn: add the dumps deploy pub key [puppet] - 10https://gerrit.wikimedia.org/r/281455 [15:25:28] (03PS2) 10ArielGlenn: add the dumps deploy pub key [puppet] - 10https://gerrit.wikimedia.org/r/281455 [15:25:49] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177842 (10greg) a:5greg>3None [15:38:23] 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177962 (10BBlack) Also, confirmed that do_stream of a non-chunked fetch doesn't cause chunked response on cache_upload. [15:39:11] (03Abandoned) 10ArielGlenn: add the dumps deploy pub key [puppet] - 10https://gerrit.wikimedia.org/r/281455 (owner: 10ArielGlenn) [15:39:55] (03PS1) 10Ema: Misc cluster VCL: avoid name conflict between directors and probes [puppet] - 10https://gerrit.wikimedia.org/r/281457 (https://phabricator.wikimedia.org/T131501) [15:40:54] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177971 (10Cmjohnson) @greg Downtime Max 10 minutes but not even that long. Can do whenever you're ready [15:41:53] 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177973 (10BBlack) Another input here: in the common case, it seems MediaWiki outputs content with TE:chunked, too. [15:42:48] !log ran wikitech-static updates [15:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:38] !log aqs1001 re-added to the aqs pool (nodejs NOT upgraded due to issues with Cassandra) [15:45:52] elukey: remove the leading space :) [15:46:26] greg-g: ah snap sorry! [15:49:05] (03PS1) 10ArielGlenn: add dumps deployment key [puppet] - 10https://gerrit.wikimedia.org/r/281460 [15:49:14] ACKNOWLEDGEMENT - RAID on stat1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) Elukey Opened a phab task for DC Ops: https://phabricator.wikimedia.org/T131758 [15:49:34] --^ sorry I forgot to ack icinga [15:51:01] (03CR) 10Thcipriani: [C: 031] "Looks awesome, cleans up a lot of cruft." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [15:51:15] (03CR) 10ArielGlenn: [C: 032] add dumps deployment key [puppet] - 10https://gerrit.wikimedia.org/r/281460 (owner: 10ArielGlenn) [15:53:11] !log 15:45 < elukey> !log aqs1001 re-added to the aqs pool (nodejs NOT upgraded due to issues with Cassandra) [15:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:54:46] 6Operations, 10DBA, 6Labs, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#2178009 (10Dispenser) [15:59:03] greg-g: sorry I was doing 3 things at the time and didn't get why you told me to re-log, I probably need coffee :) [15:59:16] 6Operations: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746#2178020 (10MoritzMuehlenhoff) The following updates from jessie 8.4 have been deployed: pcre3 gnupg apt nettle giflib subversion unbound stress [15:59:16] (or do less things, or both) [15:59:32] thanks anyway :) [16:01:01] :) [16:04:47] 6Operations, 6Labs, 13Patch-For-Review: Labtest designate giving out Forbidden exceptions when trying to list domains - https://phabricator.wikimedia.org/T130979#2178026 (10Andrew) 5Open>3Resolved a:3Andrew [16:06:19] (03CR) 10ArielGlenn: "note that soon I'll have to add dumps_blahblah to this patchset (or to the merged code if it's merged before I get there)." [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [16:16:24] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178070 (10mmodell) Thanks @dzahn for setting this up so quickly. I tested that and I wa... [16:16:55] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178072 (10mmodell) [16:27:43] 6Operations, 10Traffic, 7Varnish: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2178082 (10ema) [16:29:44] argh [16:29:56] wait a moment [16:30:47] that was it. Phabricator is up again since more than two hours, so I guess omeone forgot to change the topic? [16:31:56] Luke081515: I'm sure yes thanks [16:32:22] np ;) [16:32:34] jouncebot next [16:32:35] In 3 hour(s) and 27 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T2000) [16:33:21] mafk: Need to deploy e easy change? if it's so, I can do it for you at the evening SWAT [16:33:31] I have already two patches scheduled [16:33:42] Luke081515: well, I have an astwiki logo one [16:33:48] not sure if doing it today [16:33:51] which gerrit number? [16:33:56] lemme search [16:34:34] 280445 [16:35:18] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [16:35:38] mafk: Sounds good. I you want, I can do it at evening SWAT [16:35:53] Luke081515: if you want, that'd be good [16:36:04] ok, I will sign it up [16:36:08] !log Restarting bootstrap of restbase2004.codfw.wmnet : T95253 [16:36:09] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [16:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:14] I already have two, so it's not a big deal ;) [16:36:20] :D thank you [16:37:39] (03PS1) 10Rush: toollabs elastic don't use nginx light [puppet] - 10https://gerrit.wikimedia.org/r/281464 (https://phabricator.wikimedia.org/T131644) [16:40:01] (03PS1) 10Faidon Liambotis: Move network::checks to netops::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/281465 [16:40:03] (03PS1) 10Faidon Liambotis: netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 [16:40:05] (03PS1) 10Faidon Liambotis: Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) [16:41:40] (03CR) 10jenkins-bot: [V: 04-1] netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 (owner: 10Faidon Liambotis) [16:41:53] (03CR) 10jenkins-bot: [V: 04-1] Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) (owner: 10Faidon Liambotis) [16:43:00] (03PS2) 10Faidon Liambotis: Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) [16:43:02] (03PS2) 10Faidon Liambotis: netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 [16:45:06] Luke081515: I see that there's an inminent (?) migration from gerrit to differential for mediawiki-config. Is archanist the program we should use to commit to differential/diffussion? [16:45:54] T131418 [16:45:54] T131418: Migrate mediawiki-config to Differential - https://phabricator.wikimedia.org/T131418 [16:46:21] (03PS2) 10Rush: toollabs elastic don't use nginx light [puppet] - 10https://gerrit.wikimedia.org/r/281464 (https://phabricator.wikimedia.org/T131644) [16:47:38] 6Operations, 10Ops-Access-Requests: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2178157 (10RobH) a:3RobH Discussed in the ops meeting, and its granted. However, we need to chat with Sam about 3 phase power and how it operates so no misleading figures are disclosed. So I'll se... [16:54:28] mafk: Yes [16:56:15] paladox: thank you [16:56:23] your welcome [16:58:33] (03CR) 10Rush: [C: 032] toollabs elastic don't use nginx light [puppet] - 10https://gerrit.wikimedia.org/r/281464 (https://phabricator.wikimedia.org/T131644) (owner: 10Rush) [16:59:23] hey _joe_. Wanted to talk to you briefly about https://phabricator.wikimedia.org/T118495 [16:59:38] ^ uwsgi restarts taking exactly 1 minutes and 30 seconds. [16:59:48] We learned some things and have a proposed short-term fix [17:02:49] I'm jet lagging and going to turn into a pile of mush, so I'm going to head offline for now. [17:02:57] Should be more normalish tomorrow. [17:02:58] o/ [17:06:15] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2178278 (10greg) >>! In T131742#2177971, @Cmjohnson wrote: > @greg Downtime Max 10 minutes but not even that long. Can do whenever yo... [17:06:27] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2178279 (10greg) a:3Cmjohnson [17:12:57] 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178301 (10mmodell) [17:14:42] 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178316 (10jeremyb-phone) [17:15:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:15:53] 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178325 (10greg) See also: {T131742} :) [17:17:25] (03PS2) 10Faidon Liambotis: Move network::checks to netops::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/281465 [17:17:27] (03PS3) 10Faidon Liambotis: netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 [17:17:29] (03PS1) 10Faidon Liambotis: netops: add IPv6 host checks [puppet] - 10https://gerrit.wikimedia.org/r/281473 [17:17:31] (03PS1) 10Faidon Liambotis: netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 [17:18:01] apergos: snapshot1005 puppet failure [17:18:12] yep working on it [17:18:14] Jeff_Green: alnilam has a disk alert [17:18:26] paravoid: looking [17:18:34] Jeff_Green: silicon too [17:18:46] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:18:56] Jeff_Green: and silicon also has a check_amq_store alert for a while now [17:19:15] andrewbogott/chasemp: labvirt1002 disk space alert too [17:19:31] paravoid: thanks, will look post-meeting [17:19:38] (03CR) 10jenkins-bot: [V: 04-1] netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 (owner: 10Faidon Liambotis) [17:20:14] (03PS2) 10Faidon Liambotis: netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 [17:21:10] (03CR) 10Faidon Liambotis: [C: 032] Move network::checks to netops::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/281465 (owner: 10Faidon Liambotis) [17:21:58] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:23:50] paravoid: ok, it's all the same thing ultimately--they're both basically early warnings re. activemq store to bloat [17:24:07] I'll adjust the thresholds and harangue fr-tech [17:25:29] what's with stashbot? [17:26:12] (03CR) 10Faidon Liambotis: [C: 032] netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 (owner: 10Faidon Liambotis) [17:26:33] Will anyone have an issue if Phabricator is down for up to 10 mins? [17:27:18] fine for me cmjohnson1 [17:29:04] (03PS1) 10ArielGlenn: fix up name of dumps repo for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281476 [17:29:19] cmjohnson1: doit, not real "great" time today, so jfdi :) [17:29:24] s/not/no/ [17:29:46] k....i posted in -devtool channel [17:29:58] will take it down in 10 mins [17:29:59] thanks [17:30:26] (03CR) 10ArielGlenn: [C: 032] fix up name of dumps repo for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281476 (owner: 10ArielGlenn) [17:30:33] argh [17:30:42] (03PS2) 10Faidon Liambotis: netops: add IPv6 host checks [puppet] - 10https://gerrit.wikimedia.org/r/281473 [17:30:47] !log Phabricator going down in about 10 minutes to hopefully address the overheating issue: T131742 [17:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:52] (03CR) 10Faidon Liambotis: [C: 032] netops: add IPv6 host checks [puppet] - 10https://gerrit.wikimedia.org/r/281473 (owner: 10Faidon Liambotis) [17:31:52] (03PS3) 10Faidon Liambotis: netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 [17:32:00] (03CR) 10Faidon Liambotis: [C: 032 V: 032] netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 (owner: 10Faidon Liambotis) [17:32:21] pre-emptive /topic change :) [17:34:18] (03PS3) 10Faidon Liambotis: Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) [17:37:46] !log shutting down iridium to reapply thermal paste [17:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:48] PROBLEM - Router interfaces on mr1-codfw.oob is CRITICAL: CRITICAL: No response from remote host 216.117.46.36 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:40:19] PROBLEM - Router interfaces on mr1-eqiad.oob is CRITICAL: CRITICAL: No response from remote host 198.32.107.153 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:40:48] PROBLEM - Router interfaces on mr1-esams.oob is CRITICAL: CRITICAL: No response from remote host 164.138.24.90 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:40:56] ignore those, that would be me [17:41:29] PROBLEM - Router interfaces on mr1-ulsfo.oob is CRITICAL: CRITICAL: No response from remote host 209.237.234.242 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:41:48] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [17:42:28] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [17:42:29] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [17:42:37] <_joe_> oh, heh [17:42:42] <_joe_> that would be chris [17:43:10] yep [17:43:11] ..."Phabricator will be down for about 10 minutes (T131742)" referencing a Phab-ticket....for when Phab is down...smart... [17:43:12] :p [17:43:20] Josve05a: hey! what do you want?! :P [17:43:32] see the content of that ticket xD [17:44:03] Josve05a: for when phab is back, see also: https://phabricator.wikimedia.org/T131775 (us asking for a backup web server for phab to make these things suck less) [17:44:19] pfft [17:44:20] paravoid: I can't find the labvirt1002 disk space warning, did it clear on its own or was that a typo? [17:44:33] how much does such a thing cos (actual cost) [17:44:34] It's just that iridium went down and someone is now (re-?)applying thermal paste on the CPU to prevent it from crashing again :) [17:44:35] cost* [17:44:48] andrewbogott: I guess it cleared on its own [17:44:51] it would be too long to put that in the topic [17:45:40] I want to read the phab ticket but it is down [17:45:43] :D [17:46:12] paravoid: do you remember what the numbers were? Was it a warning or a crit? [17:46:26] no and it was a warning [17:46:36] ok, good enough for me — thanks! [17:46:37] you should send an announcement somewhere that people can read once the main thing is down, like wikitech-l [17:47:38] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [17:48:58] +1 Amir1 . Link somewhere that isn't down (such as a wikitech-l archive post) to explain. [17:49:30] yeah :) [17:50:31] <_joe_> who broke icinga? [17:50:41] <_joe_> the config, I mean [17:51:07] I thought paravoid was making various config changes [17:51:09] pfw-eqiad [17:51:09] iridium is back up [17:51:17] Could not find any host matching 'pfw-eqiad' (config file '/etc/icinga/puppet_services.cfg', starting on line 322901) [17:51:34] just puppet convergence errors I think [17:51:37] I'm rerunning puppet [17:51:48] <_joe_> paravoid: heh, that can happen, yes [17:52:07] phab is up [17:52:18] <_joe_> ok, ttyl :) [17:52:59] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [17:53:07] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2178470 (10Cmjohnson) a:5Cmjohnson>3None Clean off the old thermal paste and reapplied. Let's monitor for the next few days. Lea... [17:55:29] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:57:28] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [18:04:57] !log db1052 swapping failed disk slot 8 [18:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:28] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [18:09:08] (03CR) 10Mobrovac: Scap3: chown the target root dir if owned by root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac) [18:11:45] (03PS1) 10Faidon Liambotis: netops::check: make IPv4 mandatory again [puppet] - 10https://gerrit.wikimedia.org/r/281480 [18:14:04] 6Operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T131701#2178548 (10Volans) I've sync with @Cmjohnson and he swapped the disk, the RAID is now rebuilding: ``` $ sudo megacli -PDRbld -ShowProg -PhysDrv [32:8] -aALL Rebuild Progress on Device at Enclosure 32, Slot 8 Completed... [18:15:06] (03CR) 10Faidon Liambotis: [C: 032] netops::check: make IPv4 mandatory again [puppet] - 10https://gerrit.wikimedia.org/r/281480 (owner: 10Faidon Liambotis) [18:18:20] !log stat1002 swapping failed disk slot 11 [18:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:59] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:24:20] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0] [18:25:01] (03PS1) 10Rush: Revert "toollabs elastic don't use nginx light" [puppet] - 10https://gerrit.wikimedia.org/r/281482 [18:27:03] (03PS2) 10Rush: Revert "toollabs elastic don't use nginx light" [puppet] - 10https://gerrit.wikimedia.org/r/281482 [18:30:40] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:33:15] (03CR) 10Rush: [C: 032] Revert "toollabs elastic don't use nginx light" [puppet] - 10https://gerrit.wikimedia.org/r/281482 (owner: 10Rush) [18:35:11] PROBLEM - Router interfaces on pfw-codfw is CRITICAL: CRITICAL: host 208.80.153.195, interfaces up: 90, down: 12, dormant: 0, excluded: 0, unused: 0BRge-11/0/14: down - BRge-11/0/15: down - BRge-2/0/8: down - BRreth0: down - BRge-9/0/3: down - BRge-11/0/6: down - BRge-0/0/3: down - BRge-2/0/9: down - BRge-2/0/14: down - BRge-11/0/7: down - BRge-0/0/2: down - BRge-9/0/2: down - BR [18:35:30] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 108, down: 2, dormant: 0, excluded: 0, unused: 0BRvlan.1131: down - Subnet frack-external1-c-eqiadBRreth0: down - BR [18:36:20] !log ran MassMessages/sendMessages.php for T128056 [18:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:29] (03CR) 10Bmansurov: "The new language overlay is in stable now: https://en.m.wikipedia.org/wiki/Book?mobileaction=stable#/languages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov) [18:44:27] 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2176955 (10hashar) @cmjohnson do we have a system to monitor temperature? lm_sensors comes to mind, also found out Diamond has a coll... [18:46:31] (03CR) 10BBlack: [C: 031] "+1 because this is a decent workaround for now." [puppet] - 10https://gerrit.wikimedia.org/r/281457 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [18:47:38] 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178647 (10mmodell) >>! In T131775#2178325, @greg wrote: > See also: {T131742} :) Plus, every deployment involves significant downtime because phabricator services must al... [18:47:50] (03PS1) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 [18:49:21] (03CR) 10jenkins-bot: [V: 04-1] add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 (owner: 10ArielGlenn) [18:49:56] (03PS2) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 [18:50:01] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:50:54] (03CR) 10jenkins-bot: [V: 04-1] add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 (owner: 10ArielGlenn) [18:51:01] (03PS3) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 [18:52:27] (03CR) 10jenkins-bot: [V: 04-1] add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 (owner: 10ArielGlenn) [18:53:32] (03PS1) 10Faidon Liambotis: netops: disable SNMP checks for OOB interfaces [puppet] - 10https://gerrit.wikimedia.org/r/281490 [18:54:04] (03PS2) 10Faidon Liambotis: netops: disable SNMP checks for OOB interfaces [puppet] - 10https://gerrit.wikimedia.org/r/281490 [18:55:48] (03CR) 10Faidon Liambotis: [C: 032] netops: disable SNMP checks for OOB interfaces [puppet] - 10https://gerrit.wikimedia.org/r/281490 (owner: 10Faidon Liambotis) [18:57:34] (03PS4) 10Faidon Liambotis: Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) [19:04:01] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 108, down: 0, dormant: 0, excluded: 1, unused: 0 [19:05:21] (03PS4) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 [19:06:38] (03CR) 10Faidon Liambotis: [C: 032] Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) (owner: 10Faidon Liambotis) [19:10:03] PROBLEM - puppet last run on restbase-test2003 is CRITICAL: CRITICAL: puppet fail [19:11:53] 6Operations, 10Ops-Access-Requests: global root access for gilles - https://phabricator.wikimedia.org/T130910#2178741 (10ori) 5declined>3Open First of all, the request was not for global root. The task description makes it clear that the access request is for Swift machines. My follow-up comment was a prop... [19:13:12] (03PS5) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 [19:15:01] (03CR) 10ArielGlenn: [C: 032] add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 (owner: 10ArielGlenn) [19:16:15] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:18:16] RECOVERY - Router interfaces on pfw-codfw is OK: OK: host 208.80.153.195, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 [19:19:10] (03PS1) 10ArielGlenn: set up scap3 deploy for snapshot1006 snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/281494 [19:19:26] PROBLEM - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms [19:19:28] chasemp: can I get the config lock on cr1-eqiad? [19:20:02] yep go for it I'm puzzling over a failure to reorder here no worries paravoid [19:20:30] chasemp: can you rollback; quit? [19:21:34] (03CR) 10ArielGlenn: [C: 032] set up scap3 deploy for snapshot1006 snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/281494 (owner: 10ArielGlenn) [19:22:17] ACKNOWLEDGEMENT - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms Faidon Liambotis Still under provisioning [19:22:23] paravoid: gtg I think? [19:22:31] andrewbogott: labvirt1002 again -- DISK WARNING - free space: /var/lib/nova/instances 158834 MB (6% inode=99%): [19:22:42] paravoid: ok, thanks [19:25:53] 6Operations, 10Ops-Access-Requests: global root access for gilles - https://phabricator.wikimedia.org/T130910#2150515 (10BBlack) >>! In T130910#2178741, @ori wrote: > First of all, the request was not for global root. The task description makes it clear that the access request is for Swift machines. My follow-... [19:26:46] (03CR) 10Thcipriani: Scap3: chown the target root dir if owned by root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac) [19:28:17] 6Operations, 10Monitoring, 10netops, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#2178759 (10faidon) [19:28:36] RECOVERY - puppet last run on restbase-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:31:36] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178776 (10chasemp) Afaik the 'talk to phabricator' portion here is relevant for git-ssh... [19:38:04] !log depool maps-test2004 [19:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:46] greg-g: I guess you forgot to update the topic?^^ [19:40:01] !log disabled puppet on snapshot1001,2,4 while new hosts come on line, til probably Apr 5-6 [19:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:15] Luke081515: yeah, oops [19:40:24] apergos: ugh why [19:40:39] (and I went away to lunch ;) ) [19:41:04] (03PS2) 10Krinkle: webperf: Rename navtiming 'loading' and 'sending' to standard equivalent [puppet] - 10https://gerrit.wikimedia.org/r/281082 [19:41:06] (03PS1) 10Faidon Liambotis: netops: monitor all asw/msw/psw as well [puppet] - 10https://gerrit.wikimedia.org/r/281495 (https://phabricator.wikimedia.org/T83992) [19:41:19] paravoid: trying to refactor without worrying about the old hosts, which will be removed from puppet right after that new ones are running [19:41:27] (decommissioned) [19:41:37] it's only til tomorrow [19:41:37] well then remove the includes from puppet or something [19:42:54] I don't actually want it (yet) to remove anything just in case I get stuck; this leaves me room to back out (and yes, sometimes commenting out or removing a stanza means things get removed, annoyingly enough) [19:42:55] (03CR) 10Faidon Liambotis: [C: 032] netops: monitor all asw/msw/psw as well [puppet] - 10https://gerrit.wikimedia.org/r/281495 (https://phabricator.wikimedia.org/T83992) (owner: 10Faidon Liambotis) [19:43:25] then leave it as-is and don't merge your refactors [19:43:35] this is really not the right way to refactor [19:44:54] again, it's only for a day. not even that [19:45:40] I just know I cannot get it done tonight, without it taking a lot longer than if I wait til tomorrow am [19:50:32] (03PS3) 10Krinkle: webperf: Rename navtiming 'loading' and 'sending' to standard equivalent [puppet] - 10https://gerrit.wikimedia.org/r/281082 [19:50:34] (03PS1) 10Krinkle: webperf: Convert navtiming metric mapping into list [puppet] - 10https://gerrit.wikimedia.org/r/281497 [19:50:36] (03PS1) 10Krinkle: webperf: Collect metrics for 'domInteractive' and 'domComplete' [puppet] - 10https://gerrit.wikimedia.org/r/281498 [19:50:39] (03PS1) 10BBlack: VCL: remove all non-default between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/281499 (https://phabricator.wikimedia.org/T131761) [19:51:49] (03PS2) 10BBlack: VCL: remove all non-default between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/281499 (https://phabricator.wikimedia.org/T131761) [19:55:00] (03PS1) 10Krinkle: coal-web: Show domInteractive instead of domComplete [puppet] - 10https://gerrit.wikimedia.org/r/281501 [20:00:05] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T2000). [20:00:24] no mobileapps deployment today [20:00:26] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [20:04:25] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 3 failures [20:04:49] !log starting parsoid deploy [20:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:31] !log synced code; restarted parsoid on wtp1002 as a canary [20:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:57] ACKNOWLEDGEMENT - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 3 failures alexandros kosiaris masked nodejs services [20:12:23] !log finished deploying parsoid sha 579ec3e6 [20:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:47] gwicke, subbu, is labs instance 'appservice' still in use? And, if so, can it survive a few minutes of downtime? [20:13:14] * subbu doesn't know what it is used for [20:13:35] andrewbogott: it is in use, yes [20:13:55] gwicke: and, downtime ok? [20:13:58] it is exercised as part of RB integration tests on each commit [20:14:22] a few minutes of downtime should be okay [20:14:32] we can always re-run failed tests [20:15:44] gwicke: ok, thank you, I'm going to migrate it elsewhere [20:16:05] (03PS1) 10Faidon Liambotis: monitoring: add "switches" hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/281529 [20:16:11] andrewbogott: thank you! [20:16:34] (03CR) 10Faidon Liambotis: [C: 032 V: 032] monitoring: add "switches" hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/281529 (owner: 10Faidon Liambotis) [20:18:36] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [20:21:21] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [20:21:43] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:24:52] PROBLEM - Juniper alarms on asw-c-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms [20:25:06] 6Operations, 10Monitoring, 10netops, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#2178874 (10faidon) [20:38:08] (03PS1) 10ArielGlenn: [WIP] basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 [20:38:38] (03PS6) 10Gehel: Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson) [20:39:14] (03CR) 10Gehel: "Added "20" to the commit message as it was also added as a potential master." [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson) [20:39:30] (03CR) 10jenkins-bot: [V: 04-1] [WIP] basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 (owner: 10ArielGlenn) [20:44:31] (03PS2) 10ArielGlenn: [WIP] basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 [20:52:30] 6Operations, 10Traffic, 10Wikimedia-Shop, 7HTTPS: shop switches HTTPS -> HTTP when showing login prompt (on clicking checkout) - https://phabricator.wikimedia.org/T63528#2179008 (10GHoltman) 5Open>3Resolved a:3GHoltman Resolved per HuiZSF [20:59:45] do we have canary servers for mediawiki ? I know about the test servers (mw1017, etc), but do we have canaries with real user traffic ? [21:00:30] (03PS3) 10ArielGlenn: basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 [21:01:06] (03PS1) 10Andrew Bogott: Increase Horizon timeout to 24 hours. [puppet] - 10https://gerrit.wikimedia.org/r/281531 (https://phabricator.wikimedia.org/T130621) [21:01:08] gehel: I can see some mw with value canary and single_canary if you grep on puppet [21:01:49] (03PS2) 10Andrew Bogott: Increase Horizon timeout to 24 hours. [puppet] - 10https://gerrit.wikimedia.org/r/281531 (https://phabricator.wikimedia.org/T130621) [21:02:01] but I don't know how they are treated [21:02:12] (03PS3) 10Andrew Bogott: Increase Horizon timeout to 24 hours. [puppet] - 10https://gerrit.wikimedia.org/r/281531 (https://phabricator.wikimedia.org/T130621) [21:02:19] (03PS4) 10ArielGlenn: basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 [21:03:56] (03CR) 10Andrew Bogott: [C: 032] Increase Horizon timeout to 24 hours. [puppet] - 10https://gerrit.wikimedia.org/r/281531 (https://phabricator.wikimedia.org/T130621) (owner: 10Andrew Bogott) [21:08:54] volans: If I read this correctly, the "canary" mw servers in puppet code is just a way to mark them for salt. Which probably indicates that I could use them as early deploys for CirrusSearch over HTTPS... [21:10:06] looks like this, but I didn't check if then that tag is used somewhere else [21:10:23] (03PS5) 10ArielGlenn: basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 [21:10:50] time to go to sleep. I'll check more in depth tomorrow... [21:11:35] (03CR) 10ArielGlenn: [C: 032] basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 (owner: 10ArielGlenn) [21:16:42] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Puppet has 1 failures [21:28:08] (03PS1) 10ArielGlenn: dumps: include the decl of repodir in the class using it [puppet] - 10https://gerrit.wikimedia.org/r/281533 [21:29:49] (03CR) 10ArielGlenn: [C: 032] dumps: include the decl of repodir in the class using it [puppet] - 10https://gerrit.wikimedia.org/r/281533 (owner: 10ArielGlenn) [21:31:42] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [21:35:03] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:32] (03PS1) 10ArielGlenn: explicitly define repodir on snpashots for cron run [puppet] - 10https://gerrit.wikimedia.org/r/281536 [21:44:11] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:45:48] (03CR) 10ArielGlenn: [C: 032] explicitly define repodir on snpashots for cron run [puppet] - 10https://gerrit.wikimedia.org/r/281536 (owner: 10ArielGlenn) [21:47:51] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:50:52] PROBLEM - puppet last run on mw2049 is CRITICAL: CRITICAL: puppet fail [21:59:01] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:04:23] 6Operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T131701#2179269 (10Volans) ``` $ sudo megacli -PDRbld -ShowProg -PhysDrv [32:8] -aALL Rebuild Progress on Device at Enclosure 32, Slot 8 Completed 47% in 237 Minutes. ``` All looks good on out monitoring metrics and on the host. [22:04:53] (03PS1) 10ArielGlenn: commit the change of the var lookup for the repodir [puppet] - 10https://gerrit.wikimedia.org/r/281541 [22:06:28] (03CR) 10ArielGlenn: [C: 032] commit the change of the var lookup for the repodir [puppet] - 10https://gerrit.wikimedia.org/r/281541 (owner: 10ArielGlenn) [22:18:22] RECOVERY - puppet last run on mw2049 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [22:23:17] (03PS1) 10ArielGlenn: minor restructure of hiera data for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281544 [22:24:39] (03CR) 10ArielGlenn: [C: 032] minor restructure of hiera data for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281544 (owner: 10ArielGlenn) [22:36:52] (03PS1) 10ArielGlenn: enabling hhvm for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281547 (https://phabricator.wikimedia.org/T94277) [22:38:08] (03CR) 10ArielGlenn: [C: 032] enabling hhvm for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281547 (https://phabricator.wikimedia.org/T94277) (owner: 10ArielGlenn) [22:41:56] (03CR) 10Dereckson: [C: 031] Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [22:45:39] jouncebot next [22:45:39] In 0 hour(s) and 14 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T2300) [22:47:23] ACKNOWLEDGEMENT - Juniper alarms on asw-c-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Faidon Liambotis 2½ year old alarm. Will be fixed on the next JunOS upgrade. [22:48:17] (03PS2) 10Faidon Liambotis: hhvm: use random tmpdir in hhvm-collect-heaps [puppet] - 10https://gerrit.wikimedia.org/r/263829 [22:48:41] (03CR) 10Faidon Liambotis: [C: 032 V: 032] hhvm: use random tmpdir in hhvm-collect-heaps [puppet] - 10https://gerrit.wikimedia.org/r/263829 (owner: 10Faidon Liambotis) [22:48:56] apergos: you have unmerged changes [22:49:00] ok to merge? [22:49:06] crap [22:49:10] yes please [22:49:18] thanks [22:49:20] paravoid: [22:49:45] (done) [22:50:46] (03PS1) 10ArielGlenn: add mediawiki packages and other dependencies for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281552 [22:50:57] thanks! [22:52:06] (03CR) 10Faidon Liambotis: [C: 032] Remove scs-oe11-esams DNS [dns] - 10https://gerrit.wikimedia.org/r/281116 (owner: 10Faidon Liambotis) [22:52:58] paravoid: Is stashbot still banned? [22:53:02] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 1 failures [22:54:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [22:55:04] Hello. [22:55:23] Luke081515: I don't think it is banned, it just isn't working [22:55:31] it was banned by me [22:55:39] I turned it off [22:56:18] in case you're curious: https://github.com/bd808/tools-stashbot/issues/9 [22:57:50] Krenair or MaxSem > can we work together on the SWAT? [22:57:53] (03PS2) 10ArielGlenn: add mediawiki packages and other dependencies for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281552 [22:58:07] Dereckson, sure [22:58:32] how do you want to communicate? [22:58:52] hi [22:58:57] (03CR) 10ArielGlenn: [C: 032] add mediawiki packages and other dependencies for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281552 (owner: 10ArielGlenn) [22:59:52] (03PS3) 10Faidon Liambotis: resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [23:00:04] Dereckson wants to deploy the patch for his homewiki? ;) [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T2300). Please do the needful. [23:00:04] Luke081515: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] How can we do that? Do you know a quick software to share a screen? Or I can also create a shared@ account on my server, from there we can share a tmux, and I su under my account. [23:00:13] * Luke081515 is already here [23:00:17] (03PS4) 10Faidon Liambotis: resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [23:00:21] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:00:41] Dereckson, you haven't deployed by yourself yet? [23:00:59] (03CR) 10Faidon Liambotis: [C: 032] resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [23:01:05] Er yes, one config patch Saturday morning. [23:01:15] and the logo for cs. [23:01:30] what do you mean saturday morning? [23:01:49] stuff happened on saturday? huh, ok [23:01:56] 08:29 logmsgbot: dereckson@tin Synchronized wmf-config/throttle.php: Fix throttle rules (Gerrit change 280819). (duration: 00m 29s) [23:02:16] the usual ip / IP typo in throttle.php to fix. [23:02:34] srsly, dude [23:02:38] In theory I got three changes, so everyone can deploy :D [23:02:46] on Saturday, your first deploy [23:03:33] Dereckson, anyway, do you have any questions about deploying? [23:03:45] wait, Dereckson why did you deploy on Saturday? [23:04:13] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: puppet fail [23:04:33] greg-g: reedy added to the hackathon a throttle rule [23:04:55] he copied/pasted a block I wrote for another rule [23:05:06] both blocked used "ip" in lowercase instead of "IP" in uppercase [23:05:07] do you even have access to the contact list in case you broke prod? [23:05:10] which turns out to have been broken [23:05:19] MaxSem, you mean the staff contact list? [23:05:26] yep [23:05:37] Dereckson doesn't have an account on that wiki if that's what you're asking [23:05:47] like, people who can fix cluster if it goes boom [23:06:04] Reedy were on the channel at this moment with me. [23:06:54] maybe you can discuss that after the SWAT? The swat actually blocks my work on my bot [23:07:05] I'm a little sad that your first deploy happened on a saturday, even with Reedy's prodding [23:08:22] (03PS2) 10Faidon Liambotis: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) [23:08:30] ok, so, I'll talk with Dereckson and Reedy tomorrow or so about this, for now just get the swat done [23:08:39] ok [23:08:41] Dereckson is there a problem with this swat? [23:08:49] I'll do it meanwhile [23:08:57] Dereckson: for the record, please only deploy during swat windows [23:09:01] No, it seems three changes to merge with only sync-file. [23:09:04] greg-g: okay [23:09:42] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [23:10:57] (03CR) 10MaxSem: [C: 032] Two permission changes at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281314 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515) [23:11:12] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [23:11:33] (03Merged) 10jenkins-bot: Two permission changes at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281314 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515) [23:13:45] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/281314/ (duration: 00m 28s) [23:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:51] Luke081515, ^ [23:13:55] (03PS1) 10ArielGlenn: fix one more stray directory reference in snapshot misc cron job [puppet] - 10https://gerrit.wikimedia.org/r/281560 [23:14:01] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures [23:14:28] change broke nothing, but one problem with this change: https://cs.wikipedia.org/wiki/Speci%C3%A1ln%C3%AD:Seznam_u%C5%BEivatelsk%C3%BDch_pr%C3%A1v [23:14:39] created one additional group instead of adding one permission [23:14:41] greg-g: MaxSem: by the way, erratum, it were Friday morning, the first day of the hackathon [23:14:44] I will write a fix [23:15:31] (03CR) 10MaxSem: [C: 032] Update project logo for ast.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280445 (https://phabricator.wikimedia.org/T131247) (owner: 10MarcoAurelio) [23:16:00] (03PS1) 10Faidon Liambotis: otrs: remove monitoring::service['https'] [puppet] - 10https://gerrit.wikimedia.org/r/281561 [23:16:19] (03Merged) 10jenkins-bot: Update project logo for ast.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280445 (https://phabricator.wikimedia.org/T131247) (owner: 10MarcoAurelio) [23:16:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] otrs: remove monitoring::service['https'] [puppet] - 10https://gerrit.wikimedia.org/r/281561 (owner: 10Faidon Liambotis) [23:16:31] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:16:32] Dereckson: Do got the link where the name of the course coordinator is listed again? I need to fix the name [23:17:29] !log maxsem@tin Synchronized static/images/project-logos/astwiki.png: https://gerrit.wikimedia.org/r/#/c/280445/ (duration: 00m 27s) [23:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:52] (03PS2) 10Faidon Liambotis: otrs: remove monitoring::service['https'] [puppet] - 10https://gerrit.wikimedia.org/r/281561 [23:17:52] ok, this patch works [23:18:18] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/280445/ (duration: 00m 27s) [23:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:36] (03CR) 10Faidon Liambotis: [V: 032] otrs: remove monitoring::service['https'] [puppet] - 10https://gerrit.wikimedia.org/r/281561 (owner: 10Faidon Liambotis) [23:19:10] now about the frwiki patch... do we know how to test it? [23:19:23] Luke081515: https://www.mediawiki.org/wiki/Extension:Education_Program/Preferences#Course_coordinators [23:19:26] we can take a look if the protection level is there [23:19:36] and I'm in #wikipedia-fr, can ask an sysop there [23:19:40] thx, dereckson [23:20:44] Dereckson: There is something wrong at that mw.org page: My patch has the exact user group name as there [23:20:57] 6Operations, 6Labs, 10Tool-Labs, 7Icinga: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#1840320 (10faidon) Ping? If this can't be fixed anytime soon, can we remove the check from the servers on puppet at least? (I've been auditing acknowledged-but-forgotte... [23:20:58] but the patch created an additional group [23:21:52] found the right name at the source code [23:22:11] ok... [23:22:38] mw.org has ep.coordiantor, while the program uses epcoordinator [23:22:44] *ep-coordiantor [23:22:54] (03CR) 10MaxSem: [C: 032] Add 'editextendedsemiprotected' protection level on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281070 (https://phabricator.wikimedia.org/T131109) (owner: 10Luke081515) [23:24:10] (03Merged) 10jenkins-bot: Add 'editextendedsemiprotected' protection level on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281070 (https://phabricator.wikimedia.org/T131109) (owner: 10Luke081515) [23:24:22] PROBLEM - HTTPS on mendelevium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: SSL connect attempt failed with unknown error error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol [23:25:12] !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/281070/ (duration: 00m 32s) [23:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:56] (ignore the mendelevium alert, see my patchset above) [23:26:00] (not that anyone is looking :P) [23:26:26] ok, all looks like expected, I ask at the frwp channel to try it out [23:26:59] (03PS1) 10Luke081515: fix epcoordinator name at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281565 (https://phabricator.wikimedia.org/T131684) [23:27:14] maxsem: ^ [23:28:26] (03CR) 10MaxSem: [C: 032] fix epcoordinator name at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281565 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515) [23:28:52] (03Merged) 10jenkins-bot: fix epcoordinator name at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281565 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515) [23:29:45] maxsem, dereckson: Ok, patch for frwp works: https://fr.wikipedia.org/w/index.php?title=Utilisateur%3AAsh_Crow%2FBrouillon%2FFirefly&type=revision&diff=125012838&oldid=105231109 [23:30:06] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/281565/ (duration: 00m 29s) [23:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:13] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: puppet fail [23:30:38] maxsem: Got it, works now as expected [23:30:41] thanks for SWAT [23:30:44] awsum [23:31:26] 6 out of 19 alerts are yours now :P [23:33:08] MaxSem: so where can I got a copy of this contact list? [23:34:03] (03PS1) 10ArielGlenn: remove nutcracker for now, broken manifest on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281569 [23:34:18] Dereckson, you can't without diving into officewiki from the server side [23:34:30] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: puppet fail [23:35:14] (03PS2) 10ArielGlenn: remove nutcracker for now, broken manifest on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281569 [23:37:05] (03CR) 10ArielGlenn: [C: 032] remove nutcracker for now, broken manifest on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281569 (owner: 10ArielGlenn) [23:40:28] Dereckson: I fixed the wrong information at mw.org about the education programm group names [23:40:31] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:44:02] Luke081515: perfect [23:54:55] ACKNOWLEDGEMENT - puppet last run on snapshot1005 is CRITICAL: CRITICAL: puppet fail arielglenn packages missing on jessie (of course). [23:54:55] ACKNOWLEDGEMENT - puppet last run on snapshot1006 is CRITICAL: CRITICAL: puppet fail arielglenn packages missing on jessie (of course). [23:54:55] ACKNOWLEDGEMENT - puppet last run on snapshot1007 is CRITICAL: CRITICAL: puppet fail arielglenn packages missing on jessie (of course). [23:56:02] PROBLEM - Apache HTTP on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [23:57:51] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.051 second response time [23:58:31] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]