[00:00:01] (03PS2) 10Ori.livneh: Sync ps_mem.py from origin [puppet] - 10https://gerrit.wikimedia.org/r/264028 (owner: 10John Vandenberg) [00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160115T0000). [00:00:04] IoannisKydonis andre__: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:47] (03PS3) 10Ori.livneh: Sync ps_mem.py from origin [puppet] - 10https://gerrit.wikimedia.org/r/264028 (owner: 10John Vandenberg) [00:02:35] (03CR) 10Ori.livneh: [C: 032 V: 032] Sync ps_mem.py from origin [puppet] - 10https://gerrit.wikimedia.org/r/264028 (owner: 10John Vandenberg) [00:06:56] !log restbase started a dump of enwiki to populate storage with mobileapps renders [00:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:09] (03CR) 10Mobrovac: [C: 04-1] Add parsoid::testing role and use it on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264024 (owner: 10Subramanya Sastry) [00:14:13] hm [00:14:19] I suppose I should swat then [00:14:46] even though no one is here [00:15:17] tgr, around? [00:15:26] Krenair: o/ [00:15:41] so we have a request to deploy https://gerrit.wikimedia.org/r/#/c/264091/ [00:16:02] yeah, it's a GCI task [00:17:06] is that all okay with you, and can you check everything is working as expected after deployment? [00:18:09] yes [00:18:37] (03PS9) 10Alex Monk: Update MediaViewer configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [00:19:03] (03CR) 10Alex Monk: [C: 032] Update MediaViewer configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [00:19:37] (03Merged) 10jenkins-bot: Update MediaViewer configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [00:21:27] (03PS1) 10Andrew Bogott: Remove use of passwords::openstack::nova::nova_ldap_user_pass [puppet] - 10https://gerrit.wikimedia.org/r/264217 [00:22:00] !log krenair@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/264091/ (duration: 00m 32s) [00:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:08] tgr, ^ [00:23:14] (03PS5) 10Alex Monk: Lift IP rate limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263905 (https://phabricator.wikimedia.org/T123458) (owner: 10Samtar) [00:24:00] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 31s) [00:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:24:30] tgr, all ok? [00:24:38] (03PS5) 10Subramanya Sastry: Add parsoid::testing role and use it on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/264024 [00:24:40] (03PS3) 10Subramanya Sastry: WIP: Add the visualdiff module; instantiate psd visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 [00:24:55] no effect so far, but it's a ResourceLoader change, those tend to take a few minutes [00:25:27] (03CR) 10Mobrovac: WIP: Add the visualdiff module; instantiate psd visualdiffing services (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264032 (owner: 10Subramanya Sastry) [00:26:09] (03CR) 10jenkins-bot: [V: 04-1] Add parsoid::testing role and use it on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/264024 (owner: 10Subramanya Sastry) [00:26:11] (03CR) 10Alex Monk: [C: 032] Lift IP rate limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263905 (https://phabricator.wikimedia.org/T123458) (owner: 10Samtar) [00:26:30] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add the visualdiff module; instantiate psd visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (owner: 10Subramanya Sastry) [00:26:48] (03Merged) 10jenkins-bot: Lift IP rate limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263905 (https://phabricator.wikimedia.org/T123458) (owner: 10Samtar) [00:27:03] (03PS1) 10Cenarium: Autopromotion: remove deprecated onView event, fix INGROUPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 [00:27:36] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/263905/ (duration: 00m 32s) [00:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:51] (03CR) 10Mobrovac: [C: 04-1] Add parsoid::testing role and use it on ruthenium (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264024 (owner: 10Subramanya Sastry) [00:30:54] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:25] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:26] Krenair: all good, thanks [00:32:52] actually, http://en.wikipedia.beta.wmflabs.org/wiki/Special:ListFiles is broken, but that's unrelated [00:35:35] filed as https://phabricator.wikimedia.org/T123695 [00:46:33] (03PS2) 10Andrew Bogott: Remove use of passwords::openstack::nova::nova_ldap_user_pass [puppet] - 10https://gerrit.wikimedia.org/r/264217 [00:48:02] might do a backport of a core change to fix an error occurring in prod [00:49:29] 6operations, 7Mail: delete exim alias vpe-staff: eng-mgt - https://phabricator.wikimedia.org/T123667#1935936 (10JKrauska) Spoke to praveena who originally requested this -- safe to remove... [00:49:52] 6operations, 7Mail: delete exim alias vpe-staff: eng-mgt - https://phabricator.wikimedia.org/T123667#1935938 (10JKrauska) a:5JKrauska>3Dzahn [00:57:07] (03PS3) 10Andrew Bogott: Remove use of passwords::openstack::nova::nova_ldap_user_pass [puppet] - 10https://gerrit.wikimedia.org/r/264217 [01:00:36] !log krenair@tin Synchronized php-1.27.0-wmf.10/includes/api/ApiQueryWatchlist.php: https://gerrit.wikimedia.org/r/#/c/264224/ (duration: 00m 31s) [01:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:00:57] hmm [01:01:00] I don't think that fully fixed it [01:01:11] pretty sure there's still the RC one causing issues [01:06:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:07:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:10:29] (03PS1) 10Cmjohnson: Adding piwik-roots group bug: task# T122325 [puppet] - 10https://gerrit.wikimedia.org/r/264229 [01:11:15] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:11:37] (03CR) 10jenkins-bot: [V: 04-1] Adding piwik-roots group bug: task# T122325 [puppet] - 10https://gerrit.wikimedia.org/r/264229 (owner: 10Cmjohnson) [01:12:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:13:15] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:14:05] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:25:31] (03PS3) 10Mattflaschen: Exclude fishbowl and add computed dblist for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 [01:38:35] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: puppet fail [01:55:07] (03PS6) 10Subramanya Sastry: Add parsoid::testing role and use it on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/264024 [01:55:10] (03PS4) 10Subramanya Sastry: WIP: Add the visualdiff module; instantiate psd visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 [02:00:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [02:01:55] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0] [02:05:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:05:55] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:06:05] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:09:45] PROBLEM - puppet last run on bromine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:43] !log pull annualreport git repo on bromine for Krenair [02:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:02] thanks YuviPanda, also can you add that puppet swat window please? [02:26:12] oh yeah [02:26:14] sure [02:26:20] or I suppose I can try to guess all the details [02:26:23] based on previous ones [02:26:25] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:27] thanks for reminding me [02:26:35] Krenair: that's what I'm going to do :D [02:26:40] Krenair: since I"ve no real memory anyway [02:27:00] Krenair: it's the same time as thursday, same people [02:28:56] PROBLEM - HHVM rendering on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:02] tables [02:29:04] how do they work [02:29:25] PROBLEM - Apache HTTP on mw1126 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.039 second response time [02:29:55] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 14m 00s) [02:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:48] !log krenair@tin Synchronized php-1.27.0-wmf.10/includes/api/ApiQueryRecentChanges.php: https://gerrit.wikimedia.org/r/264231 (duration: 00m 42s) [02:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:58] That seems to have stopped the rc_log_type errors [02:31:05] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 69502 bytes in 2.881 second response time [02:31:44] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [02:33:34] Krenair: I think that's right [02:33:50] I just copied the people who are in for thursday [02:34:26] Well. [02:34:28] https://office.wikimedia.org/wiki/Operations/Operations_Meeting_Notes/TechOps-2016-01-13#PuppetSWAT [02:35:10] yeah ok [02:35:13] moritzm and mutante did it today/yesterday [02:35:14] so the deployments page is wrong [02:39:38] (03CR) 10Krinkle: [WIP] Implement /w/static.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [02:41:44] I don't know how people survive wikitext [02:41:46] what a clusterfuck [02:57:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:58:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:00:04] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:02:14] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:15] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:03:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:03:22] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 16m 02s) [03:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:03:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:10:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jan 15 03:10:09 UTC 2016 (duration 6m 48s) [03:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:21:35] PROBLEM - puppet last run on mw2055 is CRITICAL: CRITICAL: puppet fail [03:39:21] (03PS2) 10Cmjohnson: Adding piwik-roots group bug: task# T122325 [puppet] - 10https://gerrit.wikimedia.org/r/264229 [03:40:34] (03CR) 10jenkins-bot: [V: 04-1] Adding piwik-roots group bug: task# T122325 [puppet] - 10https://gerrit.wikimedia.org/r/264229 (owner: 10Cmjohnson) [03:48:45] RECOVERY - puppet last run on mw2055 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:03:14] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: puppet fail [04:30:24] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [04:38:24] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [04:43:04] 6operations, 7Mail: remove or update ea@ alias? - https://phabricator.wikimedia.org/T123286#1936163 (10JKrauska) ok to remove -- now rebuilt as a google group. [04:43:15] 6operations, 7Mail: remove or update ea@ alias? - https://phabricator.wikimedia.org/T123286#1936164 (10JKrauska) a:5JKrauska>3Dzahn [04:43:23] 6operations, 7Mail: remove ea@ alias? - https://phabricator.wikimedia.org/T123286#1936165 (10JKrauska) [05:16:14] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0] [05:17:25] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:18:15] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:19:34] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:39:31] (03PS1) 10Andrew Bogott: Install nova-network and nova-api on labtestnet2001 [puppet] - 10https://gerrit.wikimedia.org/r/264256 [05:41:36] (03CR) 10Andrew Bogott: [C: 032] Install nova-network and nova-api on labtestnet2001 [puppet] - 10https://gerrit.wikimedia.org/r/264256 (owner: 10Andrew Bogott) [05:44:07] (03CR) 10Tim Starling: [C: 04-1] [WIP] Implement /w/static.php (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [06:31:01] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 5 failures [06:31:02] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:02] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:02] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:43] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:33] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:02] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:32] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:11] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:22] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:53:42] PROBLEM - puppet last run on bromine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:23] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:55:41] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:55:42] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:55:42] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:56:02] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:56:41] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:41] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [06:57:52] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:41] 6operations, 7Mail: remove shop@wikipedia.org -> board@wikipedia.org - https://phabricator.wikimedia.org/T123672#1936232 (10faidon) Probably a remnant from the days when there was little to no staff. Delete. [07:08:43] 6operations, 7Mail: remove/update office@wikipedia.org alias - https://phabricator.wikimedia.org/T123669#1936239 (10faidon) ``` root@mx1001:~# exim4 -bt dphelps@wikimedia.org dphelps@wikimedia.org is undeliverable: Address dphelps@wikimedia.org does not exist root@mx1001:~# exim4 -bt dphelps@wikipedia.org dphe... [07:09:47] 6operations, 7Mail: remove staff@wikipedia.org - https://phabricator.wikimedia.org/T123670#1936241 (10faidon) a:5JKrauska>3Dzahn Well we (ops) receive noc@, and I've never seen an email to staff@ in that mailbox during my tenure here, nor do I think we should be the ones receiving something addressed to "s... [07:10:31] PROBLEM - puppet last run on bromine is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:51] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1936246 (10santhosh) Note that cxserver uses RESTBase and does not have any references to parsoid [08:16:21] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/264125 (owner: 10Alex Monk) [08:34:07] (03PS1) 10Catrope: Add wgScriptPath to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264260 [08:40:22] PROBLEM - puppet last run on mw2060 is CRITICAL: CRITICAL: puppet fail [08:44:38] (03PS2) 10Catrope: Add wgScriptPath to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264260 [08:44:40] (03PS1) 10Catrope: Add wgScriptPath to InitialiseSettings in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264261 [08:45:30] (03CR) 10Catrope: [C: 032] Add wgScriptPath to InitialiseSettings in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264261 (owner: 10Catrope) [08:45:57] (03Merged) 10jenkins-bot: Add wgScriptPath to InitialiseSettings in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264261 (owner: 10Catrope) [08:47:42] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /a 328434 MB (3% inode=99%) [08:49:05] (03CR) 10Alexandros Kosiaris: [C: 04-2] "niah, we will just use alsafi's public IP. I am already migrating role::url_downloader to allow for that in https://gerrit.wikimedia.org/r" [dns] - 10https://gerrit.wikimedia.org/r/264208 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [08:52:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "let's just use alsafi's IP. It will never be used for anything else anyway. I am already working on a patch to allow that to happen in rol" [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [09:00:12] RECOVERY - Disk space on stat1002 is OK: DISK OK [09:06:22] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [09:06:52] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [09:07:02] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:31] RECOVERY - puppet last run on mw2060 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:17:22] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.861 second response time [09:17:59] (03PS1) 10Muehlenhoff: Remove special case handling for labs realm [puppet] - 10https://gerrit.wikimedia.org/r/264264 [09:18:23] !log installed git security updates on all jessie systems [09:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:11] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: puppet fail [09:50:42] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:52:51] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 8.691 second response time [10:01:32] !log installed ganeti security updates [10:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:42] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:07:52] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.707 second response time [10:08:42] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:23:03] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [10:27:11] 6operations: Reduce rpcbind use - https://phabricator.wikimedia.org/T106477#1936292 (10MoritzMuehlenhoff) A similar issue exists with smbclient: nagios-plugins-standard recommends it, which in turn pulls in a lot of Samba-related libraries. Since we neither use CIFS shares nor any Windows systems, I doubt it is... [10:35:35] 6operations, 5Patch-For-Review, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1936296 (10fgiunchedi) I did some `upgrade` + `dist-upgrade` testing in labs with `swift-upgrade-ms-fe01` and `swift-upgrade-ms-be01` to move precise -> trusty and it seems successful [10:40:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 886 [10:45:11] RECOVERY - check_mysql on db1008 is OK: Uptime: 246351 Threads: 2 Questions: 1566646 Slow queries: 1596 Opens: 1101 Flush tables: 2 Open tables: 397 Queries per second avg: 6.359 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:48:14] 6operations, 5Patch-For-Review, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1936338 (10MoritzMuehlenhoff) Is there a specific feature from 2.5 that we need compared to the stock swift version in jessie? We could also migrated to standard jessie first (which is still a leap forwa... [10:50:31] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:54:36] 6operations: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1936342 (10fgiunchedi) 3NEW [11:02:11] 6operations: Reimage hooft with jessie and rename to bast3001 - https://phabricator.wikimedia.org/T123712#1936351 (10MoritzMuehlenhoff) 3NEW [11:02:43] 6operations: Reimage hooft with jessie and rename to bast3001 - https://phabricator.wikimedia.org/T123712#1936359 (10MoritzMuehlenhoff) [11:02:45] 6operations: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936358 (10MoritzMuehlenhoff) [11:04:25] 6operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10GitHub-Mirrors, and 2 others: [Task] Redirect unused extensions/ValueView repository to data-values/value-view - https://phabricator.wikimedia.org/T123624#1936360 (10thiemowmde) https://wikitech.wikimedia.org/wiki/Gerrit#Deleting_repositorie... [11:07:09] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936362 (10hashar) [11:07:12] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [11:07:33] <_joe_> !log re-enabled puppet on mw1013, restarted HHVM to make it pick up our latest changes [11:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:07:59] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10hashar) [11:08:37] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10hashar) Added T95757 as a blocker. `gallium;wikimedia.org` is Precise and hosts Zuul scheduler / Jenkins. [11:11:22] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:27:17] 6operations: Reinstall magnesium with jessie - https://phabricator.wikimedia.org/T123713#1936373 (10MoritzMuehlenhoff) 3NEW [11:27:37] 6operations: Reinstall magnesium with jessie - https://phabricator.wikimedia.org/T123713#1936381 (10MoritzMuehlenhoff) [11:29:58] 6operations, 5Patch-For-Review, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1936382 (10fgiunchedi) no critical 2.5 feature afaict, mostly performance related. migrating all swift to trusty first has the added bonus of getting us in a uniform situation, codfw is trusty already and... [11:32:39] 6operations: Reinstall caesium with jessie - https://phabricator.wikimedia.org/T123714#1936383 (10MoritzMuehlenhoff) 3NEW [11:34:52] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936390 (10MoritzMuehlenhoff) [11:34:53] 6operations: Reinstall caesium with jessie - https://phabricator.wikimedia.org/T123714#1936391 (10MoritzMuehlenhoff) [11:35:59] 6operations, 7Tracking: Reinstall nitrogen with jessie - https://phabricator.wikimedia.org/T123715#1936392 (10fgiunchedi) 3NEW [11:37:13] 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, 10Wikidata: [Bug] ValueView GitHub mirror not updated any more - https://phabricator.wikimedia.org/T123521#1936413 (10thiemowmde) p:5Normal>3Unbreak! Latest commits and new tags we created still do not show up on the GitHub mirror. [11:38:33] 6operations: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#1936415 (10MoritzMuehlenhoff) 3NEW [11:38:58] 6operations: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#1936422 (10MoritzMuehlenhoff) [11:39:36] 6operations: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#1936415 (10MoritzMuehlenhoff) [11:39:37] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936424 (10MoritzMuehlenhoff) [11:40:08] 6operations, 7Tracking: Reinstall protactinium with jessie - https://phabricator.wikimedia.org/T123720#1936432 (10fgiunchedi) 3NEW [11:40:59] 6operations: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#1936438 (10MoritzMuehlenhoff) 3NEW [11:41:25] 6operations: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#1936446 (10MoritzMuehlenhoff) [11:41:26] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10MoritzMuehlenhoff) [11:45:19] 6operations, 7Tracking: Reinstall erbium with jessie - https://phabricator.wikimedia.org/T123722#1936447 (10fgiunchedi) 3NEW [11:46:35] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936458 (10MoritzMuehlenhoff) [11:46:36] 6operations: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#1936453 (10MoritzMuehlenhoff) 5Open>3Resolved a:3MoritzMuehlenhoff Not needed, it's in the process of being decommissioned, see T123029 [11:47:05] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10MoritzMuehlenhoff) [11:47:06] 6operations, 7Tracking: Reinstall erbium with jessie - https://phabricator.wikimedia.org/T123722#1936460 (10MoritzMuehlenhoff) 5Open>3Resolved a:3MoritzMuehlenhoff Not needed, it's in the process of being decommissioned, see T123029 [11:47:37] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10MoritzMuehlenhoff) [11:47:38] 6operations: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#1936466 (10MoritzMuehlenhoff) 5Resolved>3Open Argh, wrong tab, I meant to close a diffent bug. Reopen [11:52:44] 6operations: Move bacula director and storage daemon off helium? - https://phabricator.wikimedia.org/T123723#1936475 (10MoritzMuehlenhoff) 3NEW [11:55:42] 6operations, 10Dumps-Generation: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#1936489 (10MoritzMuehlenhoff) 3NEW [11:56:06] (03PS1) 10Filippo Giunchedi: swift: reinstall ms-fe300* with trusty [puppet] - 10https://gerrit.wikimedia.org/r/264275 (https://phabricator.wikimedia.org/T117972) [11:56:14] 6operations, 10Dumps-Generation: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#1936498 (10MoritzMuehlenhoff) [11:56:16] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936497 (10MoritzMuehlenhoff) [11:57:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: reinstall ms-fe300* with trusty [puppet] - 10https://gerrit.wikimedia.org/r/264275 (https://phabricator.wikimedia.org/T117972) (owner: 10Filippo Giunchedi) [11:58:12] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10MoritzMuehlenhoff) [11:58:15] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1936501 (10MoritzMuehlenhoff) [12:01:47] 6operations: Migrate titanium to jessie - https://phabricator.wikimedia.org/T123725#1936508 (10MoritzMuehlenhoff) [12:02:08] 6operations: Migrate titanium to jessie - https://phabricator.wikimedia.org/T123725#1936502 (10MoritzMuehlenhoff) [12:02:09] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936511 (10MoritzMuehlenhoff) [12:07:59] 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, 10Wikidata: [Bug] ValueView GitHub mirror not updated any more - https://phabricator.wikimedia.org/T123521#1936516 (10thiemowmde) I'm waiting for ValueView 0.15.7 to be released for a month now, see https://gerrit.wikimedia.org/r/259760. I tried to m... [12:09:03] 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, and 2 others: [Bug] ValueView GitHub mirror not updated any more - https://phabricator.wikimedia.org/T123521#1936518 (10thiemowmde) [12:09:49] 6operations, 7Tracking: Reinstall protactinium with jessie - https://phabricator.wikimedia.org/T123720#1936526 (10MoritzMuehlenhoff) [12:12:02] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:15:58] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Ubuntu 14.04 from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1936535 (10MoritzMuehlenhoff) [12:16:01] 6operations, 10Dumps-Generation: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#1936534 (10MoritzMuehlenhoff) [12:16:14] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Ubuntu 14.04 from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1901606 (10MoritzMuehlenhoff) [12:16:29] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936538 (10MoritzMuehlenhoff) [12:16:30] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Ubuntu 14.04 from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1901606 (10MoritzMuehlenhoff) [12:17:18] 6operations, 10Analytics: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#1936541 (10MoritzMuehlenhoff) [12:17:20] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10MoritzMuehlenhoff) [12:17:58] !log restarting Jenkins for plugins updates [12:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:18:40] ms-fe3001 is reinstalling so icinga will weep shortly [12:19:53] 6operations, 10Dumps-Generation: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#1936547 (10ArielGlenn) p:5Triage>3Normal a:3ArielGlenn [12:21:09] 6operations: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#1936549 (10MoritzMuehlenhoff) 3NEW [12:21:29] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936556 (10MoritzMuehlenhoff) [12:21:30] 6operations: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#1936557 (10MoritzMuehlenhoff) [12:24:31] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:35:28] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1936562 (10MoritzMuehlenhoff) [12:35:30] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936561 (10MoritzMuehlenhoff) [12:37:51] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:38:57] 6operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10MoritzMuehlenhoff) 3NEW [12:39:35] 6operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936573 (10MoritzMuehlenhoff) [12:39:36] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936572 (10MoritzMuehlenhoff) [12:41:51] 6operations: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#1936574 (10MoritzMuehlenhoff) 3NEW [12:44:15] 6operations: Migrate puppetmaster/backends to jessie - https://phabricator.wikimedia.org/T123730#1936586 (10MoritzMuehlenhoff) 3NEW [12:44:33] 6operations: Migrate puppetmaster/backends to jessie - https://phabricator.wikimedia.org/T123730#1936594 (10MoritzMuehlenhoff) [12:44:34] 6operations: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#1936595 (10MoritzMuehlenhoff) [12:48:26] 6operations: Migrate labsdb1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#1936600 (10MoritzMuehlenhoff) 3NEW [12:50:12] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:51:33] 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1936607 (10MoritzMuehlenhoff) 3NEW [12:51:55] 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1936615 (10MoritzMuehlenhoff) [12:51:57] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936614 (10MoritzMuehlenhoff) [12:51:57] 6operations: Migrate labsdb1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#1936616 (10MoritzMuehlenhoff) [12:56:15] 6operations: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#1936623 (10MoritzMuehlenhoff) 3NEW [12:56:34] 6operations: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#1936632 (10MoritzMuehlenhoff) [12:56:36] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936631 (10MoritzMuehlenhoff) [13:00:52] 6operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#1936635 (10MoritzMuehlenhoff) 3NEW [13:02:23] 6operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#1936643 (10MoritzMuehlenhoff) Also, consider T123723 for helium [13:03:04] 6operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#1936646 (10MoritzMuehlenhoff) [13:03:05] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936645 (10MoritzMuehlenhoff) [13:10:32] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:51] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 7.594 second response time [13:18:59] can somone merge https://gerrit.wikimedia.org/r/#/c/263842/ ? it is not deployment time, but at 00:00 i am not here O_O [13:21:40] Is there a phab ticket for it? [13:24:53] for addinga import source? no. [13:26:08] (03CR) 10Chad: [C: 032] adding w:hewiki to wgImportSources. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263842 (owner: 10Steinsplitter) [13:26:16] Let's do it :) [13:26:37] cool, thanks :-) [13:27:00] (03Merged) 10jenkins-bot: adding w:hewiki to wgImportSources. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263842 (owner: 10Steinsplitter) [13:27:52] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [24.0] [13:28:32] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: w:he as import source for commonswiki (duration: 00m 49s) [13:28:36] Steinsplitter: All done ^ [13:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:46] thx [13:28:56] yw [13:29:21] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [13:29:41] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [13:30:27] (03PS3) 10Cmjohnson: Adding piwik-roots group bug: task# T122325 [puppet] - 10https://gerrit.wikimedia.org/r/264229 [13:33:49] (03PS1) 10Filippo Giunchedi: admin: report nonexistant users and duplicate gid [puppet] - 10https://gerrit.wikimedia.org/r/264281 [13:34:32] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [13:35:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: report nonexistant users and duplicate gid [puppet] - 10https://gerrit.wikimedia.org/r/264281 (owner: 10Filippo Giunchedi) [13:43:02] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 1 failures [13:44:31] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Puppet has 1 failures [13:44:51] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: puppet fail [13:46:42] (03PS3) 10Muehlenhoff: Prevent access to hidden directories [puppet] - 10https://gerrit.wikimedia.org/r/217794 (https://phabricator.wikimedia.org/T94570) [13:47:02] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:21] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.824 second response time [13:53:12] moritzm: Hehe, wasn't sure if anyone was still interested which is why I bumped it :) [13:59:29] ostriches: i was lurking in the darker corners of my inbox :-) [14:02:12] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [14:03:52] !log set sync_speed_min to 5000 for md126 on labstore1001 [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:21] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:07:16] (03PS4) 10Cmjohnson: Adding piwik-roots group bug: task# T122325 [puppet] - 10https://gerrit.wikimedia.org/r/264229 [14:08:02] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:08:32] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.351 second response time [14:08:42] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:09:49] !log phab restart phd (reports as not running in phab itself) seems ok now [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:32] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:12:02] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:12:13] chasemp: I saw that too (like 30-45m ago?) but they kicked themselves back into shape [14:12:43] I think there is an illusive bug where the daemons stick on a task and phab thinks they are missing for a bit [14:13:20] I've seen it a few times that were suspect but since it's all stateful on restart should be ok either way [14:13:25] !log Temporarily paused md126 RAID check on labstore1001 (sync_action idle) [14:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:11] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:17:02] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:22] RECOVERY - DPKG on labmon1001 is OK: All packages OK [14:20:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:20:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:21:45] (03CR) 10Cmjohnson: [C: 032] Adding piwik-roots group bug: task# T122325 [puppet] - 10https://gerrit.wikimedia.org/r/264229 (owner: 10Cmjohnson) [14:23:21] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.924 second response time [14:24:31] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:24:41] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:26:33] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:27:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:27:02] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:27:42] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:52] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:26] 6operations: Replace labstore1002 RAID card - https://phabricator.wikimedia.org/T123740#1936716 (10chasemp) 3NEW a:3Cmjohnson [14:32:40] 6operations, 10ops-eqiad: Replace labstore1002 RAID card - https://phabricator.wikimedia.org/T123740#1936716 (10chasemp) [14:37:01] 6operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#1936727 (10MoritzMuehlenhoff) Since we're using precise's version of icu, working on this is part of ops's quarterly goal to get rid of precise ;-) I don't really understand how it's linked to icu 4... [14:38:23] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:41:12] (03PS1) 10Cmjohnson: Adding admin role to piwik.yaml bug: task# T122325 [puppet] - 10https://gerrit.wikimedia.org/r/264287 [14:42:55] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:43:34] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:43:45] cmjohnson1: ^ [14:44:12] mortizm...yep, thanks! [14:45:43] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:47:14] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [14:50:44] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:51:11] (03CR) 10Filippo Giunchedi: [C: 04-1] Adding admin role to piwik.yaml bug: task# T122325 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264287 (owner: 10Cmjohnson) [14:52:01] 6operations: Reimage hooft with jessie and rename to bast3001 - https://phabricator.wikimedia.org/T123712#1936742 (10Southparkfan) [14:54:19] !log reimage ms-fe3002 with trusty [14:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:23] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.474 second response time [15:01:14] (03PS2) 10Cmjohnson: Adding admin role to piwik.yaml [puppet] - 10https://gerrit.wikimedia.org/r/264287 [15:03:16] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:15] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.669 second response time [15:11:55] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:18:35] (03PS3) 10Filippo Giunchedi: Adding admin role to piwik.yaml [puppet] - 10https://gerrit.wikimedia.org/r/264287 (https://phabricator.wikimedia.org/T122325) (owner: 10Cmjohnson) [15:18:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Adding admin role to piwik.yaml [puppet] - 10https://gerrit.wikimedia.org/r/264287 (https://phabricator.wikimedia.org/T122325) (owner: 10Cmjohnson) [15:24:25] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:25:17] 6operations, 6Analytics-Kanban, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1936788 (10Cmjohnson) [15:25:20] 10Ops-Access-Requests, 6operations, 10Analytics, 5Patch-For-Review: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1936785 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson The admin role has been added to and group added to piwki.yaml [15:31:42] (03PS2) 10Filippo Giunchedi: icinga: report atlas measurement url in alert text [puppet] - 10https://gerrit.wikimedia.org/r/263840 [15:32:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] icinga: report atlas measurement url in alert text [puppet] - 10https://gerrit.wikimedia.org/r/263840 (owner: 10Filippo Giunchedi) [15:34:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "https://puppet-compiler.wmflabs.org/1616/chromium.wikimedia.org/ says it's mostly what you 'd expect, but I 'll test it in labs as well fi" [puppet] - 10https://gerrit.wikimedia.org/r/264059 (owner: 10Alexandros Kosiaris) [15:37:45] 6operations, 7Tracking: Reinstall nitrogen with jessie - https://phabricator.wikimedia.org/T123715#1936844 (10fgiunchedi) [15:38:05] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:41:01] !log reimage ms-be3001 with trusty [15:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:25] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:43:35] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.416 second response time [15:49:46] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:59:10] 6operations: Update libzmq3/pyzmq - https://phabricator.wikimedia.org/T106093#1936903 (10MoritzMuehlenhoff) 5Open>3declined Closing this bug. I started to work on this a few months ago, but stopped at some point since it would have lots of effects on reverse dependencies. Also, in the mean time Ariel fixed s... [16:00:04] 6operations: Meta task for various security updates - https://phabricator.wikimedia.org/T96545#1936910 (10MoritzMuehlenhoff) 5Open>3Resolved Old meta task, no longer used by now. [16:00:54] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:01:09] !log bounce hhvm on mw1129 / mw1204 [16:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:08] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.676 second response time [16:02:40] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 69773 bytes in 0.969 second response time [16:03:08] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 69773 bytes in 0.211 second response time [16:03:09] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [16:05:51] (03CR) 10Alexandros Kosiaris: [C: 032] "tested in deployment-prep. works fine both in trusty and jessie, merging" [puppet] - 10https://gerrit.wikimedia.org/r/264059 (owner: 10Alexandros Kosiaris) [16:07:19] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.938 second response time [16:07:33] (03PS2) 10Alexandros Kosiaris: Make role::url_downloader unparameterized [puppet] - 10https://gerrit.wikimedia.org/r/264059 [16:07:35] (03PS3) 10Alexandros Kosiaris: Revert "Revert "Add the LVS blocks to url_downloader"" [puppet] - 10https://gerrit.wikimedia.org/r/207490 [16:07:52] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Make role::url_downloader unparameterized [puppet] - 10https://gerrit.wikimedia.org/r/264059 (owner: 10Alexandros Kosiaris) [16:10:30] (03CR) 10Alexandros Kosiaris: "https://gerrit.wikimedia.org/r/#/c/264059/ has been merged and seems to work fine. Let's just apply the role url_downloader on alsafi and " [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [16:13:48] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:17:59] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:20:09] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 8.812 second response time [16:20:16] (03CR) 10Muehlenhoff: "Alex confirmed the key in PGP-signed mail:" [puppet] - 10https://gerrit.wikimedia.org/r/264125 (owner: 10Alex Monk) [16:23:58] PROBLEM - salt-minion processes on mira is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:24:40] (03PS2) 10Alex Monk: Add my yubikey [puppet] - 10https://gerrit.wikimedia.org/r/264125 [16:26:14] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add my yubikey [puppet] - 10https://gerrit.wikimedia.org/r/264125 (owner: 10Alex Monk) [16:35:41] moritzm, did you run puppet on a bastion? [16:36:49] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:36:49] PROBLEM - swift-account-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:36:59] that's me ^ silence expired [16:37:00] PROBLEM - swift-account-reaper on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:37:19] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:20] PROBLEM - swift-account-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:39:29] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 8.149 second response time [16:40:54] looks like I can go to all bastions except bast1001 [16:49:39] RECOVERY - salt-minion processes on mira is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:50:10] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:52:15] now it works [16:54:30] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.082 second response time [16:55:08] poor bromine :( load average: 151.03, 150.73, 150.41 [16:55:09] RECOVERY - Host ms-be2007 is UP: PING OK - Packet loss = 0%, RTA = 38.07 ms [16:56:23] godog, I imagine it's apache serving annual.wikimedia.org/15.wikipedia.org? [16:57:29] Krenair: you are indeed correct sir [16:57:31] Krenair: no, the automatic puppet runs will fix that [16:58:09] 6operations, 10ops-codfw: ms-be2007 - System halted!Error: Integrated RAID - https://phabricator.wikimedia.org/T122844#1937003 (10Papaul) a:5Papaul>3fgiunchedi @fgiunchedi new controller is in place. [16:59:30] PROBLEM - configured eth on ms-be2007 is CRITICAL: Connection refused by host [16:59:30] PROBLEM - swift-object-replicator on ms-be2007 is CRITICAL: Connection refused by host [16:59:39] PROBLEM - puppet last run on ms-be2007 is CRITICAL: Connection refused by host [16:59:49] PROBLEM - swift-container-server on ms-be2007 is CRITICAL: Connection refused by host [16:59:50] PROBLEM - salt-minion processes on ms-be2007 is CRITICAL: Connection refused by host [17:00:00] PROBLEM - dhclient process on ms-be2007 is CRITICAL: Connection refused by host [17:00:00] PROBLEM - swift-container-updater on ms-be2007 is CRITICAL: Connection refused by host [17:00:10] PROBLEM - swift-account-auditor on ms-be2007 is CRITICAL: Connection refused by host [17:00:19] PROBLEM - Check size of conntrack table on ms-be2007 is CRITICAL: Connection refused by host [17:00:19] PROBLEM - swift-object-server on ms-be2007 is CRITICAL: Connection refused by host [17:00:29] PROBLEM - swift-object-auditor on ms-be2007 is CRITICAL: Connection refused by host [17:00:31] PROBLEM - swift-account-reaper on ms-be2007 is CRITICAL: Connection refused by host [17:00:50] PROBLEM - swift-account-replicator on ms-be2007 is CRITICAL: Connection refused by host [17:00:50] PROBLEM - swift-container-auditor on ms-be2007 is CRITICAL: Connection refused by host [17:00:51] PROBLEM - very high load average likely xfs on ms-be2007 is CRITICAL: Connection refused by host [17:01:00] PROBLEM - swift-object-updater on ms-be2007 is CRITICAL: Connection refused by host [17:01:00] PROBLEM - DPKG on ms-be2007 is CRITICAL: Connection refused by host [17:01:10] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:10] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /home 37482 MB (3% inode=98%) [17:01:10] PROBLEM - swift-account-server on ms-be2007 is CRITICAL: Connection refused by host [17:01:20] PROBLEM - Disk space on ms-be2007 is CRITICAL: Connection refused by host [17:01:39] PROBLEM - swift-container-replicator on ms-be2007 is CRITICAL: Connection refused by host [17:01:40] PROBLEM - RAID on ms-be2007 is CRITICAL: Connection refused by host [17:07:40] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 8.036 second response time [17:07:59] PROBLEM - Host ms-be2007 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:14] annual/15 should be mostly static content and cacheable right? [17:10:20] RECOVERY - Host ms-be2007 is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [17:11:04] yeah, perhaps size limit? homepage.js is 770k [17:11:49] ick [17:12:34] our misc-cluster size rules are stream-mode (but still cache) for >=1MB, and no-cache for >=10MB [17:13:42] heh, so far the only difference I've found, all other assets are getting hits in varnish [17:16:21] yeah [17:16:24] 72 RxURL c /assets/img/faces_cards/shirley.jpg [17:16:24] 72 TxHeader c X-Cache: cp1056 miss(0), cp1070 frontend hit(202) [17:16:33] most of the requests are hits like that [17:17:42] requests for /piwik.php on piwik.wikimedia.org, with arguments related to the 15 campaign, are some of the heaviest hits on misc right now though, and not cacheable [17:18:08] I'm not sure what this is or how it's related, but piwik isn't on bromine [17:19:51] hmmm most js is cacheable, e.g. [17:19:51] 83 RxURL c /assets/js/masonry.js [17:19:51] 83 TxHeader c X-Cache: cp1056 hit(5), cp1070 frontend hit(97) [17:20:08] but homepage.js is not, and it's heavy on the top URLs list: [17:20:09] 77 RxURL c /assets/js/homepage.js [17:20:09] 77 TxHeader c X-Cache: cp1069 pass(0), cp1070 frontend pass(0) [17:20:28] indeed, re: homepage.js I'm still suspecting the size but the size checks seem correct [17:20:58] and it's slow to load at all, because it's blocking and coalescing too [17:21:08] hmmmm [17:22:20] I've got a curl fetch of it from home that's been hung for nearly 2 minutes now without downloading any content :/ [17:22:39] nothing specific to homepage.js in apache or in the headers sent back from bromine afaics [17:23:58] 504 Gateway Time-out [17:24:03] on my fetch [17:25:39] yeah I got that sometime too from a browser [17:26:50] PROBLEM - Static Bugzilla HTTP on bromine is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:00] hmm? [17:28:31] (ignore me, that looks like an old alarm, as long as it's known ;) ) [17:28:47] that's been flapping all morning [17:29:09] it's slow to load even locally with curl to itself on bromine [17:29:27] but you're right, no real diff in headers [17:29:35] ostriches: kk [17:31:53] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1937078 (10Dzahn) [17:31:54] 6operations, 7Mail: remove/update office@wikipedia.org alias - https://phabricator.wikimedia.org/T123669#1937076 (10Dzahn) 5Open>3Resolved Thanks @faidon. done ``` -office: dphelps ``` [17:35:38] bblack: oh, so the piwik code is in the 15.wp /annual code to create stats and is using the new wmf piwik install at piwik.wikimedia.org [17:35:44] nuria: ^ [17:35:57] hola mutante [17:36:28] ori, ^ [17:37:15] bblack: why are those requests not cacheable? [17:37:32] bblack: we should only be requesting the beacon [17:37:53] bblack: ah wait, maybe we need to add caching headers to the apache that serves piwik trhat we do not have? [17:39:27] don't know [17:39:29] cc mutante [17:39:34] right now I don't think piwik is the problem, just notable in stats [17:40:06] bblack: regardless, our js snippet should have cache headers cc milimetric [17:40:33] yeah, we got close to 1 million requests in the last day for the wikipedia 15 site [17:40:36] goodness gracious [17:41:58] bblack, milimetric ; as far as i can see piwik.js is being served with a last-modified [17:42:14] eh, and i just merged that change to enable caching for annual.wm yesterday [17:42:16] not piwik.js, piwik.php [17:42:21] (reading up, give me a sec to catch up) [17:42:39] mutante: a lot of the 15.wp.o hits are cached, it's just not caching homepage.js for some reason [17:42:41] bblack: but who requests piwiki.php? [17:43:16] bblack: beacon is piwik.js and that seems to be cached correctly [17:43:35] bblack: let me see if i can figure out in piwik docs where those requests come from [17:44:10] RECOVERY - Static Bugzilla HTTP on bromine is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 9.946 second response time [17:44:33] bblack: that is a server side tracker so might be request by non js clients ahhhh [17:44:36] bblack: bots! [17:44:44] bblack: it's the crawlers who are executing that [17:44:51] bblack: * i think* [17:45:14] cc milimetric , makes sense , right? [17:45:51] bblack, milimetric , i think this is it: [17:46:07] bblack: we just need to do away with no-js tracking, cc milimetric [17:46:28] ?pk_campaign=kiwi&pk_keyword=anon [17:47:00] so much of that pk_campaign stuff [17:47:02] agreed, if piwik.php is causing load, we can remove the