[00:28:51] 6operations, 10MediaWiki-extensions-CentralAuth, 7Database: Special:GlobalUsers varies between claiming a user is or isn't attached - https://phabricator.wikimedia.org/T102915#1538129 (10Krenair) [00:43:53] 6operations: Grant Access to OIT to store time series data in Graphite and Access Graphana Dashboards - https://phabricator.wikimedia.org/T109028#1538161 (10JKrauska) 3NEW [00:45:03] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1538178 (10Dzahn) [00:45:12] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1091083 (10Dzahn) [00:45:14] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1538179 (10Dzahn) [00:51:04] 6operations, 7HTTPS: Chrome on OS X 10.11 ("El Capitan") does not trust Wikimedia certificates - https://phabricator.wikimedia.org/T109029#1538194 (10ori) 3NEW [00:51:07] 6operations, 7Graphite: Grant Access to OIT to store time series data in Graphite and Access Graphana Dashboards - https://phabricator.wikimedia.org/T109028#1538201 (10Krenair) [00:51:21] 10Ops-Access-Requests, 6operations, 7Graphite: Grant Access to OIT to store time series data in Graphite and Access Graphana Dashboards - https://phabricator.wikimedia.org/T109028#1538161 (10Krenair) [00:52:11] 6operations, 7HTTPS: Chrome on OS X 10.11 ("El Capitan") does not trust Wikimedia certificates - https://phabricator.wikimedia.org/T109029#1538205 (10ori) [00:52:12] 6operations, 10Traffic, 7HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1538204 (10ori) [00:55:20] (03PS2) 10Dzahn: add table for miraheze and bump version [debs/wikistats] - 10https://gerrit.wikimedia.org/r/231459 (https://phabricator.wikimedia.org/T107398) [00:57:05] (03CR) 10Dzahn: [C: 032] add table for miraheze and bump version [debs/wikistats] - 10https://gerrit.wikimedia.org/r/231459 (https://phabricator.wikimedia.org/T107398) (owner: 10Dzahn) [01:09:00] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:09:42] ? [01:10:51] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 497 bytes in 3.010 second response time [01:41:59] ori, is mw1041 still de-pooled (sorry, I wanted to start earlier today, but we had to fix and QA a bug)? [01:42:52] matt_flaschen: yeah, it's yours for as long as you need it (within reason) [01:43:22] ori, okay, thanks. [01:46:29] !log ori@tin Synchronized php-1.26wmf18/includes/OutputPage.php: I5e6c79c70: Optimize the order of styles and scripts in (duration: 00m 12s) [01:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:49:45] !log Resumed LQT->Flow conversion of mw:Project:Support_desk on mw1041 [01:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:55:46] 6operations, 10Wikimedia-Site-Requests: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112#1538282 (10Krenair) [02:04:31] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:31:57] !log l10nupdate@tin Synchronized php-1.26wmf18/cache/l10n: l10nupdate for 1.26wmf18 (duration: 06m 35s) [02:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:19] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf18) at 2015-08-14 02:35:19+00:00 [02:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:36] 6operations, 7HTTPS: Chrome on OS X 10.11 ("El Capitan") does not trust Wikimedia certificates - https://phabricator.wikimedia.org/T109029#1538306 (10BBlack) I switched my Mac to El Capitan as well a week or two ago and ran into the same thing. It's a Chrome+ElCapitan bug that Chrome will eventually fix, not... [03:07:02] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1538307 (10BBlack) Note things have changed since we last looked at this ticket. We're not using SNI certs anymore. [04:16:31] (03CR) 10Dzahn: [C: 031] "3306 and 3307 here on db1016, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/228783 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [04:36:35] 7Blocked-on-Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-General-or-Unknown, 5MW-1.26-release, and 2 others: Increase "remember me" login cookie expiry from 30 days to 1 year on Wikimedia wikis - https://phabricator.wikimedia.org/T68699#1538358 (10Mattflaschen) >>! In T68699#1534649, @BBlack... [04:39:47] (03PS1) 10Ori.livneh: base: ensure => absent on 'command-not-found' [puppet] - 10https://gerrit.wikimedia.org/r/231487 [04:39:58] 6operations, 10Traffic, 7HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1538362 (10ori) [04:39:59] 6operations, 7HTTPS: Chrome on OS X 10.11 ("El Capitan") does not trust Wikimedia certificates - https://phabricator.wikimedia.org/T109029#1538359 (10ori) 5Open>3declined a:3ori Upstream bug. [04:40:28] (03PS2) 10Ori.livneh: base: ensure => absent on 'command-not-found' [puppet] - 10https://gerrit.wikimedia.org/r/231487 [04:42:28] (03CR) 10Ori.livneh: [C: 032] base: ensure => absent on 'command-not-found' [puppet] - 10https://gerrit.wikimedia.org/r/231487 (owner: 10Ori.livneh) [04:57:21] (03CR) 10BearND: [C: 031] MobileApps: Do not use the proxy to issue requests [puppet] - 10https://gerrit.wikimedia.org/r/231463 (owner: 10Mobrovac) [05:56:45] (03PS3) 10KartikMistry: Updated package to 0.1+svn~61425 [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/231280 [06:05:10] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 11.11% of data above the critical threshold [100000000.0] [06:14:41] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [06:30:52] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [06:31:22] PROBLEM - puppet last run on cp1054 is CRITICAL Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on db1015 is CRITICAL Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on cp2002 is CRITICAL Puppet has 1 failures [06:32:21] PROBLEM - puppet last run on db2058 is CRITICAL Puppet has 1 failures [06:32:41] PROBLEM - puppet last run on db1045 is CRITICAL Puppet has 2 failures [06:33:41] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:34:22] !log reset email/password for User:Auréola after multi factor user confirmation. [06:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:55:11] RECOVERY - puppet last run on cp2002 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:55:41] RECOVERY - puppet last run on db1045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:21] RECOVERY - puppet last run on cp1054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:52] RECOVERY - puppet last run on db1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:00] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:21] RECOVERY - puppet last run on db2058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:58:41] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:10:37] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Aug 14 07:10:37 UTC 2015 (duration 10m 36s) [07:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:17:45] (03CR) 10Alexandros Kosiaris: Added tilerator service, granted kartotherian OSM DB read access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [07:21:01] (03PS3) 10Alexandros Kosiaris: service::node: Make the number of workers to start configurable [puppet] - 10https://gerrit.wikimedia.org/r/231319 (https://phabricator.wikimedia.org/T108888) (owner: 10Mobrovac) [07:21:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] service::node: Make the number of workers to start configurable [puppet] - 10https://gerrit.wikimedia.org/r/231319 (https://phabricator.wikimedia.org/T108888) (owner: 10Mobrovac) [07:22:54] (03PS2) 10Alexandros Kosiaris: MobileApps: Do not use the proxy to issue requests [puppet] - 10https://gerrit.wikimedia.org/r/231463 (owner: 10Mobrovac) [07:23:00] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] MobileApps: Do not use the proxy to issue requests [puppet] - 10https://gerrit.wikimedia.org/r/231463 (owner: 10Mobrovac) [07:26:21] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [07:27:43] <_joe_> \o/ [07:27:50] <_joe_> mobrovac, akosiaris well done [07:29:02] <_joe_> mobrovac: do we have any service that doesn't have the auto health checks? [07:30:19] yay! [07:30:28] (03PS3) 10Alexandros Kosiaris: Tilerator: start ncpu / 2 workers [puppet] - 10https://gerrit.wikimedia.org/r/231427 (https://phabricator.wikimedia.org/T108974) (owner: 10Mobrovac) [07:30:33] _joe_: well, zotero and apertium [07:30:40] but those are not nodejs services [07:31:00] tilerator and kartotherian are nodejs services though [07:31:11] but for now we do not want too much monitoring on them [07:31:21] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Tilerator: start ncpu / 2 workers [puppet] - 10https://gerrit.wikimedia.org/r/231427 (https://phabricator.wikimedia.org/T108974) (owner: 10Mobrovac) [07:32:32] 6operations, 6Discovery, 10Maps, 6Services, and 2 others: Puppetize Tilerator for deployment - https://phabricator.wikimedia.org/T105074#1538517 (10akosiaris) [07:32:35] 6operations, 6Services, 3Discovery-Maps-Sprint, 5Patch-For-Review: Configure maps cluster's tilerator to the specific number of workers - https://phabricator.wikimedia.org/T108974#1538515 (10akosiaris) 5Open>3Resolved a:3akosiaris [07:36:42] <_joe_> akosiaris: I was thinking more of making a simple script to do a rolling release with health checks, as it seems this would help services a lot, while we wait for a proper deployment system [07:36:44] 6operations, 10MediaWiki-extensions-CentralAuth, 7Database: Special:GlobalUsers varies between claiming a user is or isn't attached - https://phabricator.wikimedia.org/T102915#1538529 (10jcrespo) a:3jcrespo [07:37:03] 6operations, 10MediaWiki-extensions-CentralAuth, 7Database: Special:GlobalUsers varies between claiming a user is or isn't attached - https://phabricator.wikimedia.org/T102915#1377310 (10jcrespo) p:5Triage>3High [07:37:57] (03CR) 10Alexandros Kosiaris: "I think this should become the default. Unless there is a reason for services to use a proxy, in which case we will just configure it, the" [puppet] - 10https://gerrit.wikimedia.org/r/231463 (owner: 10Mobrovac) [07:38:17] (03PS2) 10Alexandros Kosiaris: maps: Add ALTERs to ensure passwords [puppet] - 10https://gerrit.wikimedia.org/r/231301 [07:38:51] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [07:39:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Add ALTERs to ensure passwords [puppet] - 10https://gerrit.wikimedia.org/r/231301 (owner: 10Alexandros Kosiaris) [07:42:26] (03Abandoned) 10Alexandros Kosiaris: mobileapps service: Varnish / parsoidcache configuration [puppet] - 10https://gerrit.wikimedia.org/r/207052 (https://phabricator.wikimedia.org/T92627) (owner: 10Mobrovac) [07:42:50] (03Abandoned) 10Alexandros Kosiaris: mobileapps service: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/207051 (https://phabricator.wikimedia.org/T92627) (owner: 10Mobrovac) [07:42:57] (03Abandoned) 10Alexandros Kosiaris: mobileapps service: Role and module for SCA [puppet] - 10https://gerrit.wikimedia.org/r/207050 (https://phabricator.wikimedia.org/T92627) (owner: 10Mobrovac) [07:43:46] (03Abandoned) 10Alexandros Kosiaris: Gerrit also listens on port 22 [puppet] - 10https://gerrit.wikimedia.org/r/172313 (https://bugzilla.wikimedia.org/35611) (owner: 10Dereckson) [07:47:38] (03PS4) 10Alexandros Kosiaris: Updated package to 0.1+svn~61425 [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/231280 (https://phabricator.wikimedia.org/T107270) (owner: 10KartikMistry) [07:53:11] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 3 others: Apertium leaves a ton of stale processes, consumes all the available memory - https://phabricator.wikimedia.org/T107270#1538564 (10akosiaris) That sounds awesome. Thanks! @KartikMistr... [07:56:59] (03Abandoned) 10Alexandros Kosiaris: OSM: rename osm and similar classes to osm::db [puppet] - 10https://gerrit.wikimedia.org/r/204161 (owner: 10MaxSem) [07:58:00] (03CR) 10Alexandros Kosiaris: [C: 032] Updated package to 0.1+svn~61425 [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/231280 (https://phabricator.wikimedia.org/T107270) (owner: 10KartikMistry) [08:09:53] akosiaris: thanks! [08:11:13] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-apy_0.1+svn~61425-1 [08:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:25] kart_: don't thamk me yet. let's first see if it works [08:11:32] s/m/n/ [08:11:43] thank ne? [08:12:29] Reedy: never put a /g there :P [08:12:31] akosiaris: this was for +2 and upload ;) [08:12:36] 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1538600 (10Matanya) I am not sure we need this. Phabricator can do it as good with diffusion. which @demon is working on migrating to. [08:13:16] Reedy: https://xkcd.com/208/ [08:21:53] kart_: [I 150814 08:21:00 servlet:700] 71 pair modes found [08:21:59] so it's 58 or 71 ? [08:22:14] oh, apertium support 71, but we only advertise 58 ? [08:22:42] "we" in advertise being cxserver [08:23:14] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1538621 (10jcrespo) a:5Springle>3jcrespo Reloaded dbstore2001 s7 from db2040. [08:25:31] akosiaris: 58. [08:26:32] !log upgraded and restarted apertium on sca100{1,2} [08:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:55] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 3 others: Apertium leaves a ton of stale processes, consumes all the available memory - https://phabricator.wikimedia.org/T107270#1538639 (10akosiaris) Change merged, package built, uploaded on... [08:30:12] kart_: ^ [08:30:14] now we wait [08:32:12] kart_: although I suppose we could craft some very simple curl requests to see what's happening. also that tools/sanity-test-apy.py seems like a request generator [08:34:48] (03PS2) 10Alexandros Kosiaris: Introduce mobileapps.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/230780 (https://phabricator.wikimedia.org/T105538) [08:34:49] akosiaris: yes. That's easiest way I guess. [08:37:41] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce mobileapps.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/230780 (https://phabricator.wikimedia.org/T105538) (owner: 10Alexandros Kosiaris) [08:41:53] 6operations, 3Discovery-Maps-Sprint: git deploy shows 5 tilerator instances instead of 4 - https://phabricator.wikimedia.org/T108956#1538691 (10akosiaris) I honestly have no idea how tin made it into the DB (a bug or something?). But let's remove it anyway. [08:45:56] ori, it's done. You can repool the server. Ping me back so I know you got this. [08:46:00] Thanks again. :) [08:52:03] <_joe_> matt_flaschen: I might help with repooling - ori is hopefully asleep :P [08:52:14] <_joe_> but I need some context :) [08:56:23] 6operations, 6Human-Resources: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1538733 (10Qgil) @Alantz, who could be a good contact in HR to help solving this task? [08:56:35] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1538735 (10Qgil) [08:57:16] 6operations, 10MediaWiki-extensions-CentralAuth, 7Database: Special:GlobalUsers varies between claiming a user is or isn't attached - https://phabricator.wikimedia.org/T102915#1538739 (10jcrespo) Results per database: ``` db1033: login db1028: NULL db1034: login db1039: NULL db1041: login db1062: NULL db20... [09:03:24] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra internode TLS encryption - https://phabricator.wikimedia.org/T108953#1538760 (10fgiunchedi) looks like we could do the following: 1. for each host in a given cassandra cluster, generate its public/private keypair, add the public key to the trusted stor... [09:08:54] 6operations, 10MediaWiki-extensions-CentralAuth, 7Database: Special:GlobalUsers varies between claiming a user is or isn't attached - https://phabricator.wikimedia.org/T102915#1538781 (10jcrespo) But a closer look at the query shows that this is not a data-related issue: ``` mysql> set sql_mode='ONLY_FULL_G... [09:13:28] 6operations, 7Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#1538794 (10mark) The main thing we still need it for is //aggregated// graphs, particularly for aggregated power usage in our data centers, which LibreNMS doesn't really do. When we have an alternative for that, I think we c... [09:13:42] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet-compiler: first commit [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/228849 (https://phabricator.wikimedia.org/T96802) (owner: 10Giuseppe Lavagetto) [09:16:21] (03PS3) 10Alexandros Kosiaris: Setup LVS for mobileapps service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/230790 [09:18:09] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: refresh module/role [puppet] - 10https://gerrit.wikimedia.org/r/231500 (https://phabricator.wikimedia.org/T96802) [09:19:59] (03PS2) 10Giuseppe Lavagetto: puppet_compiler: refresh module/role [puppet] - 10https://gerrit.wikimedia.org/r/231500 (https://phabricator.wikimedia.org/T96802) [09:24:22] (03PS4) 10Alexandros Kosiaris: Setup LVS for mobileapps service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/230790 [09:26:53] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_compiler: refresh module/role [puppet] - 10https://gerrit.wikimedia.org/r/231500 (https://phabricator.wikimedia.org/T96802) (owner: 10Giuseppe Lavagetto) [09:27:41] (03PS5) 10Alexandros Kosiaris: Setup LVS for mobileapps service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/230790 [09:28:05] <_joe_> hihi this is payback for yesterday [09:28:13] <_joe_> where you burned me to a merge [09:28:18] <_joe_> :P [09:28:24] ? [09:28:33] I won the race ? [09:28:38] actually it is not [09:28:43] I am indeed fixing some things [09:28:45] <_joe_> oh ok [09:29:03] <_joe_> I thought it was the classic "damn he merged before me, lemme rebase again" [09:29:16] <_joe_> I love our repo to be ff-only [09:29:21] for example I just figured out that since you did the ordered yaml change, I probably better do ordered_dicts in new_wmf_service.py as well [09:29:24] <_joe_> but sometimes it's annoying [09:29:32] I mostly find it annoying [09:29:38] <_joe_> heh, maybe :) [09:29:40] but since everyone seems to like it [09:31:33] (03PS6) 10Alexandros Kosiaris: Setup LVS for mobileapps service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/230790 [09:37:57] (03PS7) 10Alexandros Kosiaris: Setup LVS for mobileapps service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/230790 [09:38:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Setup LVS for mobileapps service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/230790 (owner: 10Alexandros Kosiaris) [09:40:40] PROBLEM - puppet last run on analytics1012 is CRITICAL puppet fail [09:51:44] 6operations, 10Wikimedia-General-or-Unknown: Missing tab styles in "DB clusters" page (on noc.wikimedia.org) - https://phabricator.wikimedia.org/T109045#1538889 (10PleaseStand) 3NEW [09:59:37] 6operations, 10MediaWiki-extensions-CentralAuth, 7Database: Special:GlobalUsers varies between claiming a user is or isn't attached - https://phabricator.wikimedia.org/T102915#1538923 (10jcrespo) This query returns the same results on all hosts: ``` SELECT MAX(gu_id), gu_name,... [10:05:00] Were the parsoid debs moved? https://lists.wikimedia.org/pipermail/wikitext-l/2015-August/000946.html [10:07:26] RECOVERY - puppet last run on analytics1012 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:08:14] Nemo_bis: I see parsoid.wmflabs.org there, you probably need the people responsible for that project [10:09:10] Yes that's their list :) I thought maybe someone here remembered some recent move [10:12:37] 6operations, 6Services, 3Mobile-Content-Service, 5Patch-For-Review, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1446023 (10akosiaris) All changes merged, icinga is happy, ops part is done I think in deploying this. I suppose it's now up to rest... [10:18:27] (03PS1) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [10:20:43] (03PS1) 10Giuseppe Lavagetto: git::install: make $lock_file truly optional [puppet] - 10https://gerrit.wikimedia.org/r/231515 [10:20:45] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: small fixes [puppet] - 10https://gerrit.wikimedia.org/r/231516 [10:20:54] (03CR) 10jenkins-bot: [V: 04-1] puppet_compiler: small fixes [puppet] - 10https://gerrit.wikimedia.org/r/231516 (owner: 10Giuseppe Lavagetto) [10:21:28] (03PS2) 10Giuseppe Lavagetto: git::install: make $lock_file truly optional [puppet] - 10https://gerrit.wikimedia.org/r/231515 [10:31:31] (03CR) 10Giuseppe Lavagetto: [C: 032] "confirmed to be a noop on the phab host." [puppet] - 10https://gerrit.wikimedia.org/r/231515 (owner: 10Giuseppe Lavagetto) [10:33:19] (03PS2) 10Giuseppe Lavagetto: puppet_compiler: small fixes [puppet] - 10https://gerrit.wikimedia.org/r/231516 [10:36:49] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_compiler: small fixes [puppet] - 10https://gerrit.wikimedia.org/r/231516 (owner: 10Giuseppe Lavagetto) [11:04:06] (03CR) 10Yurik: Tilerator: start ncpu / 2 workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231427 (https://phabricator.wikimedia.org/T108974) (owner: 10Mobrovac) [11:05:16] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: misc improvements [puppet] - 10https://gerrit.wikimedia.org/r/231527 [11:05:34] 6operations, 6Discovery, 10Maps, 6Services, and 2 others: Puppetize Tilerator for deployment - https://phabricator.wikimedia.org/T105074#1539100 (10Yurik) 5Open>3Resolved yeppi! [11:05:56] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_compiler: misc improvements [puppet] - 10https://gerrit.wikimedia.org/r/231527 (owner: 10Giuseppe Lavagetto) [11:06:03] (03CR) 10Giuseppe Lavagetto: [V: 032] puppet_compiler: misc improvements [puppet] - 10https://gerrit.wikimedia.org/r/231527 (owner: 10Giuseppe Lavagetto) [11:11:02] PROBLEM - puppet last run on labcontrol2001 is CRITICAL puppet fail [11:12:09] 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: Add user/passwords info for the production configuration file - https://phabricator.wikimedia.org/T108610#1539117 (10Yurik) 5Open>3Resolved [11:17:50] 10Ops-Access-Requests, 6operations, 7Graphite: Grant Access to OIT to store time series data in Graphite and Access Graphana Dashboards - https://phabricator.wikimedia.org/T109028#1539152 (10faidon) 5Open>3declined a:3faidon I think there are good reasons to keep the production and OIT infrastructures... [11:27:55] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1539197 (10mobrovac) p:5Normal>3High [11:28:25] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1116669 (10mobrovac) The service itself is live, hooking it up in #RESTBase today. [11:30:45] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1539201 (10Lydia_Pintscher) \o/ [11:38:12] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:50:39] (03PS1) 10Jcrespo: Pool db1018 also with the same roles as db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231541 [11:56:12] (03PS2) 10Jcrespo: Pool db1018 also with the same roles as db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231541 [11:57:02] !log reedy@tin Synchronized php-1.26wmf18/extensions/Translate: Stop calling deprecated Elastica function (duration: 00m 13s) [11:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:34] anyone against trying my last commit to confirm/deny the issue? [12:08:48] (03CR) 10Jcrespo: [C: 032] Pool db1018 also with the same roles as db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231541 (owner: 10Jcrespo) [12:09:12] (03PS2) 10Jcrespo: Update comment about database disk size for db1050 and db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231303 [12:09:34] (03CR) 10Jcrespo: [C: 032] Update comment about database disk size for db1050 and db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231303 (owner: 10Jcrespo) [12:11:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Load-balance db1036 roles (duration: 00m 11s) [12:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:12:55] (03PS8) 10Filippo Giunchedi: diamond: add upstart/systemd service stats [puppet] - 10https://gerrit.wikimedia.org/r/224093 (https://phabricator.wikimedia.org/T108027) [12:13:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: add upstart/systemd service stats [puppet] - 10https://gerrit.wikimedia.org/r/224093 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [12:23:58] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1539350 (10fgiunchedi) >>! In T95253#1532040, @fgiunchedi wrote: > I'm trying the systemd instances first, (no puppet code review yet), basically... [12:30:25] !log Restarting db1042 after data import [12:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:05] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1539380 (10mobrovac) >>! In T95253#1539350, @fgiunchedi wrote: > I'm not yet sure how we should approach the puppet part wrt multiple instances, t... [12:36:33] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: fix puppet invocation, dest directories [puppet] - 10https://gerrit.wikimedia.org/r/231546 [12:37:15] (03CR) 10jenkins-bot: [V: 04-1] puppet_compiler: fix puppet invocation, dest directories [puppet] - 10https://gerrit.wikimedia.org/r/231546 (owner: 10Giuseppe Lavagetto) [12:39:58] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1539397 (10JanZerebecki) [12:44:18] (03PS2) 10Giuseppe Lavagetto: puppet_compiler: fix puppet invocation, dest directories [puppet] - 10https://gerrit.wikimedia.org/r/231546 [12:45:01] (03CR) 10jenkins-bot: [V: 04-1] puppet_compiler: fix puppet invocation, dest directories [puppet] - 10https://gerrit.wikimedia.org/r/231546 (owner: 10Giuseppe Lavagetto) [12:45:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet_compiler: fix puppet invocation, dest directories [puppet] - 10https://gerrit.wikimedia.org/r/231546 (owner: 10Giuseppe Lavagetto) [12:47:09] (03PS3) 10Giuseppe Lavagetto: puppet_compiler: fix puppet invocation, dest directories [puppet] - 10https://gerrit.wikimedia.org/r/231546 [12:48:04] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_compiler: fix puppet invocation, dest directories [puppet] - 10https://gerrit.wikimedia.org/r/231546 (owner: 10Giuseppe Lavagetto) [12:48:35] (03CR) 10Mobrovac: [C: 04-1] "Nice! A first comment in-lined." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [12:56:20] 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1539452 (10Aklapper) I'd also say to go for Diffusion but I won't stop anyone to compare gitblit, klaus and diffusion if wanted for some reason I don't know or understand. :) [13:00:31] (03PS1) 10Jcrespo: Repool db1042 as vslow and dump roles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231550 (https://phabricator.wikimedia.org/T108316) [13:01:38] 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1539475 (10BBlack) On a first glance at their demo, the big missing thing seems to be any kind of Search ability. We really need to be able to (a) search repo names from the ra... [13:04:52] PROBLEM - Hadoop NodeManager on analytics1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:06:25] !log reedy@tin Synchronized php-1.26wmf18/extensions/Flow: Fix ElasticaQuery logspam (duration: 00m 13s) [13:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:48] !log reedy@tin Synchronized php-1.26wmf18/extensions/GeoData: Fix ElasticaQuery logspam (duration: 00m 13s) [13:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:58] wow [13:13:59] !log reedy@tin Synchronized php-1.26wmf18/extensions/CirrusSearch/: Fix ElasticaQuery logspam (duration: 00m 13s) [13:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:31] (03PS1) 10Giuseppe Lavagetto: git::install: fix default mode [puppet] - 10https://gerrit.wikimedia.org/r/231555 [13:18:56] (03CR) 10Giuseppe Lavagetto: [C: 032] git::install: fix default mode [puppet] - 10https://gerrit.wikimedia.org/r/231555 (owner: 10Giuseppe Lavagetto) [13:20:05] 6operations, 10Gitblit-Deprecate: evaluate "klaus" to replace gitblit as a git web viewer - https://phabricator.wikimedia.org/T109004#1539537 (10demon) 5Open>3declined a:3demon Diffusion it is. [13:22:43] (03PS2) 10Giuseppe Lavagetto: Make jq available on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/223974 (owner: 10EBernhardson) [13:26:12] 6operations, 7Database: TokuDB crashes frequently -consider upgrade it or seach for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#1539548 (10jcrespo) 3NEW a:3jcrespo [13:26:40] 6operations, 7Database: TokuDB crashes frequently -consider upgrade it or search for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#1539556 (10jcrespo) [13:28:21] RECOVERY - Hadoop NodeManager on analytics1016 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:35:12] (03PS1) 10BBlack: Attempt to fix CA token issue via double-value detection -> delete on .wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/231556 (https://phabricator.wikimedia.org/T109038) [13:37:43] (03PS2) 10BBlack: Attempt to fix CA token issue via double-value detection -> delete on .wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/231556 (https://phabricator.wikimedia.org/T109038) [13:45:42] (03CR) 10Ottomata: "I'd like it if jq was just part of a base install :)" [puppet] - 10https://gerrit.wikimedia.org/r/223974 (owner: 10EBernhardson) [13:47:29] (03CR) 10Jcrespo: [C: 032] Repool db1042 as vslow and dump roles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231550 (https://phabricator.wikimedia.org/T108316) (owner: 10Jcrespo) [13:53:11] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1042 (vslow, dump) (duration: 00m 12s) [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:24] (03PS3) 10BBlack: Attempt to fix CA token issue via double-value detection -> delete on .wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/231556 (https://phabricator.wikimedia.org/T109038) [13:56:34] (03CR) 10BBlack: [C: 032] "PS3 variant tested on cp1008, VCL syntax seems sane" [puppet] - 10https://gerrit.wikimedia.org/r/231556 (https://phabricator.wikimedia.org/T109038) (owner: 10BBlack) [13:58:21] (03CR) 10JanZerebecki: "Not enough knowledge about the exact vcl." [puppet] - 10https://gerrit.wikimedia.org/r/231556 (https://phabricator.wikimedia.org/T109038) (owner: 10BBlack) [13:59:43] (03CR) 10JanZerebecki: "Disregard my previous question that only affects the central auth token not the actual session cookie." [puppet] - 10https://gerrit.wikimedia.org/r/231556 (https://phabricator.wikimedia.org/T109038) (owner: 10BBlack) [14:05:26] 6operations, 7Database: huge (140GB) decrease on available disk space on db1042 - https://phabricator.wikimedia.org/T108316#1539624 (10jcrespo) 5Open>3Resolved db1042 succesfully defragmented, now it has a 40% of the disk free and not a huge ibdata1. Could be used for cloning to other servers on the same s... [14:11:50] (03PS1) 10BBlack: wikidata.org cookie workaround: add comments re task and when it can be removed [puppet] - 10https://gerrit.wikimedia.org/r/231558 (https://phabricator.wikimedia.org/T109038) [14:12:22] (03CR) 10BBlack: [C: 032 V: 032] wikidata.org cookie workaround: add comments re task and when it can be removed [puppet] - 10https://gerrit.wikimedia.org/r/231558 (https://phabricator.wikimedia.org/T109038) (owner: 10BBlack) [14:17:23] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1539651 (10jcrespo) Importing now dbstore2002, all connections on the server will have lag while this happens. [14:44:53] 6operations, 7Database: db1021 %iowait up - https://phabricator.wikimedia.org/T87277#1539688 (10jcrespo) 5Open>3declined The current status of db1021 is Okish. The BBU is not in good state and one disk has errors, but the RAID is functional. There is no actionable but to dismantle the hardware: T106847 an... [14:59:46] !log rebooting labvirt1003 [14:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:51] (03PS1) 10Tim Landscheidt: Tools: Check permissions for error.log in webservice [puppet] - 10https://gerrit.wikimedia.org/r/231564 (https://phabricator.wikimedia.org/T99576) [15:05:01] PROBLEM - Host labvirt1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:47] ^ is me, it will be back up shortly [15:06:21] (03PS1) 10Giuseppe Lavagetto: Fix typo in threads.py [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/231565 [15:06:43] (03CR) 10Tim Landscheidt: [C: 04-1] "Fails when error.log does not exist." [puppet] - 10https://gerrit.wikimedia.org/r/231564 (https://phabricator.wikimedia.org/T99576) (owner: 10Tim Landscheidt) [15:07:48] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix typo in threads.py [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/231565 (owner: 10Giuseppe Lavagetto) [15:07:51] RECOVERY - Host labvirt1003 is UPING OK - Packet loss = 0%, RTA = 0.96 ms [15:08:21] andrewbogott: is horizon already used (much)? mind if we merge https://gerrit.wikimedia.org/r/#/c/230694/1 [15:10:48] (03PS2) 10Andrew Bogott: openstack: make dashboard compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230694 (owner: 10Dzahn) [15:10:57] mutante: I’ll merge it. I’m surprised it works as is :/ [15:11:48] andrewbogott: the reason it works (if it already is on 2.4 ) is that there is a special module "mod_access_compat" that translates the 2.2 config to 2.4 config internally [15:11:59] but if it weren't loaded (by default?) it would break [15:12:05] ah — ugly! [15:13:41] (03CR) 10Andrew Bogott: [C: 032] openstack: make dashboard compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230694 (owner: 10Dzahn) [15:14:00] thx [15:14:20] (03PS1) 10Cmjohnson: Adding dns entries for new analytics boxes both production and mgmt [dns] - 10https://gerrit.wikimedia.org/r/231572 [15:15:32] mutante: applied, horizon still looks fine to me. [15:17:03] (03PS1) 10Giuseppe Lavagetto: Remove the working directory at the end of a successful run [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/231573 [15:17:18] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for new analytics boxes both production and mgmt [dns] - 10https://gerrit.wikimedia.org/r/231572 (owner: 10Cmjohnson) [15:20:48] (03CR) 10Merlijn van Deen: [C: 031] Tools: Check permissions for error.log in webservice [puppet] - 10https://gerrit.wikimedia.org/r/231564 (https://phabricator.wikimedia.org/T99576) (owner: 10Tim Landscheidt) [15:22:54] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra internode TLS encryption - https://phabricator.wikimedia.org/T108953#1539824 (10BBlack) The latter is what I'd like to do for the client auth and varnish<->varnish parts of T108580 as well, but one of the outstanding issues there is that our puppet key... [15:24:13] (03PS2) 10Tim Landscheidt: Tools: Check permissions for error.log in webservice [puppet] - 10https://gerrit.wikimedia.org/r/231564 (https://phabricator.wikimedia.org/T99576) [15:26:00] (03CR) 10Thcipriani: [C: 032] "Tested on soft errors during some downtime provided by the labvirt1003 reboot. Works as expected.2" [tools/scap] - 10https://gerrit.wikimedia.org/r/231442 (https://phabricator.wikimedia.org/T109007) (owner: 10BryanDavis) [15:26:24] (03Merged) 10jenkins-bot: Return super().main() when overriding AbstractSync.main() [tools/scap] - 10https://gerrit.wikimedia.org/r/231442 (https://phabricator.wikimedia.org/T109007) (owner: 10BryanDavis) [15:26:41] (03CR) 10Tim Landscheidt: "Tested on tools-login with:" [puppet] - 10https://gerrit.wikimedia.org/r/231564 (https://phabricator.wikimedia.org/T99576) (owner: 10Tim Landscheidt) [15:27:37] (03PS1) 10Milimetric: [WIP] Add an Analytics specific instance of RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) [15:35:22] andrewbogott: :) [15:37:44] would there be a problem with including all font packages (mediawiki::packages::fonts) on all appservers? [15:42:08] mutante: FWIW we ran into the issue in CI that the package names for fonts in the mediawiki module have evidently changed on jessie: https://phabricator.wikimedia.org/T102623 [15:43:11] thcipriani: hmm. yea. i have a pending change for that too, but WIP https://gerrit.wikimedia.org/r/#/c/218640/ [15:44:02] thcipriani: so that needs to be fixed, yes. but separately i ask because of https://phabricator.wikimedia.org/T84777 [15:44:09] "Timelines aren't rendered on image scalers. They're rendered on standard [15:44:12] mediawiki app servers : [15:44:16] mutante: no, provided you made sure all fonts are available on both distros [15:44:19] and apparently we only install the fonts on image scalers [15:44:21] s/fonts/packages/ [15:44:37] should be fine, since the image scalers were recently converted [15:44:49] so if there were any package incompatibilities they would have surfaced [15:44:54] +1 then [15:45:17] thank you [15:45:24] this would add them to non-imagescalers too [15:45:31] * ori nods [15:45:32] that's fine [15:45:37] cool:) [15:45:40] check with joe as well [15:45:44] ok [15:46:24] awwww [15:48:52] (03CR) 10coren: [C: 031] "More clarity." [puppet] - 10https://gerrit.wikimedia.org/r/231564 (https://phabricator.wikimedia.org/T99576) (owner: 10Tim Landscheidt) [16:10:12] PROBLEM - check_puppetrun on payments1003 is CRITICAL puppet fail [16:12:37] (03PS3) 10EBernhardson: Make jq available on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/223974 [16:15:12] PROBLEM - check_puppetrun on payments1003 is CRITICAL puppet fail [16:18:19] (03PS2) 10Muehlenhoff: Enable ferm rules for role::mariadb::dbstore [puppet] - 10https://gerrit.wikimedia.org/r/228237 (https://phabricator.wikimedia.org/T104699) [16:19:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm rules for role::mariadb::dbstore [puppet] - 10https://gerrit.wikimedia.org/r/228237 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [16:20:12] RECOVERY - check_puppetrun on payments1003 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:22:50] (03PS2) 10Muehlenhoff: Add ferm rules for role::mariadb::proxy [puppet] - 10https://gerrit.wikimedia.org/r/228239 (https://phabricator.wikimedia.org/T104699) [16:23:01] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for role::mariadb::proxy [puppet] - 10https://gerrit.wikimedia.org/r/228239 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [16:25:24] (03PS2) 10Muehlenhoff: Add ferm rules for role::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/228783 (https://phabricator.wikimedia.org/T104699) [16:25:32] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for role::mariadb::misc [puppet] - 10https://gerrit.wikimedia.org/r/228783 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [16:30:40] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1540008 (10chasemp) [16:30:42] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Upgrade production to elasticsearch 1.7.1 - https://phabricator.wikimedia.org/T106165#1540005 (10chasemp) 5Open>3Resolved a:3chasemp elastic1012 1.7.1 elastic1013 1.7.1 elastic1019 1.7.1 elastic1008 1.7.1 elastic1... [16:45:17] (03PS1) 10BryanDavis: beta: copy prod bits apache config [puppet] - 10https://gerrit.wikimedia.org/r/231583 [16:45:24] ostriches: ^ [16:45:41] Heh, I was doing the same thing, beat me to it. [16:45:47] Lemme cherry-pick to beta and try it [16:51:20] 6operations, 6Collaboration-Team-Backlog, 10Flow, 10MediaWiki-Redirects, 3Reading-Web: Flow url doesn't redirect to mobile - https://phabricator.wikimedia.org/T107108#1540153 (10Quiddity) [16:51:47] (03CR) 10Chad: [C: 031] "*grumble grumble* apache config duplication *grumble grumble*" [puppet] - 10https://gerrit.wikimedia.org/r/231583 (owner: 10BryanDavis) [16:52:41] 6operations, 6Collaboration-Team-Backlog, 10Flow, 10MediaWiki-Redirects, 3Reading-Web: On mobile, the Flow notification's link takes you to the desktop version of the Flow page, even though the main (background) link takes you to the mobile one (main) - https://phabricator.wikimedia.org/T107108#1540166 (1... [16:55:37] 6operations, 6Collaboration-Team-Backlog, 10Flow, 10MediaWiki-Redirects, 3Reading-Web: On mobile, the Flow notification's link takes you to the desktop version of the Flow page, even though the main (background) link takes you to the mobile one (main) - https://phabricator.wikimedia.org/T107108#1540171 (1... [16:59:54] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible (or $deploy_system in general)? - https://phabricator.wikimedia.org/T107532#1540195 (10mmodell) A key quote from the github issue: >"If you run with -vvvv you will see exactly what... [17:03:36] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible (or $deploy_system in general)? - https://phabricator.wikimedia.org/T107532#1540209 (10GWicke) > So it appears that this isn't really fixable and unfortunately detracts from ansible's... [17:07:11] 6operations, 6Discovery, 7Elasticsearch: Investigate the need for master only (non data nodes) in our ES cluster - https://phabricator.wikimedia.org/T109090#1540229 (10Krenair) [17:07:43] 6operations, 6Discovery, 7Elasticsearch: Investigate tweaking of the "wait for me" parameter for upgrades / restarts - https://phabricator.wikimedia.org/T109091#1540247 (10Krenair) [17:07:44] (03PS5) 10Chad: Move web::sites to web::prod_sites; begin unification in new class [puppet] - 10https://gerrit.wikimedia.org/r/197655 [17:11:31] 10Ops-Access-Requests, 6operations, 7Graphite: Grant Access to OIT to store time series data in Graphite and Access Graphana Dashboards - https://phabricator.wikimedia.org/T109028#1540280 (10JKrauska) Thanks for the quick response. It's much simpler for IT to continue using the hosted solution. We will kee... [17:17:01] 10Ops-Access-Requests, 6operations, 7Graphite: Grant Access to OIT to store time series data in Graphite and Access Graphana Dashboards - https://phabricator.wikimedia.org/T109028#1540314 (10Krenair) Are there any servers owned by WMF that aren't either in production datacenters or run by Office IT in SF? It... [17:20:55] 10Ops-Access-Requests, 6operations, 7Graphite: Grant Access to OIT to store time series data in Graphite and Access Graphana Dashboards - https://phabricator.wikimedia.org/T109028#1540331 (10JKrauska) @Krenair : I'm not aware of other 'WMF' servers. One concern is perhaps access to current stats vs being abl... [17:27:24] 10Ops-Access-Requests, 6operations, 7Graphite: Grant Access to OIT to store time series data in Graphite and Access Graphana Dashboards - https://phabricator.wikimedia.org/T109028#1540366 (10Krenair) >>! In T109028#1540331, @JKrauska wrote: > @faidon: Krenair says 'almost anyone' can have access. :) @faid... [17:27:36] (03CR) 10Dzahn: [C: 04-1] "ok, but only after https://gerrit.wikimedia.org/r/#/c/218640/11 is done first" [puppet] - 10https://gerrit.wikimedia.org/r/231284 (https://phabricator.wikimedia.org/T84777) (owner: 10Dzahn) [17:39:02] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1540416 (10RobH) [17:43:09] 6operations, 10ops-codfw: Create shipment for eqord (router and gear) - https://phabricator.wikimedia.org/T109109#1540442 (10RobH) 3NEW a:3Papaul [17:44:00] 6operations, 10ops-codfw: Create shipment for eqord (router and gear) - https://phabricator.wikimedia.org/T109109#1540442 (10RobH) IRC Update: Papaul will confirm with Faidon that he can unplug the router. We think he can, but since the shipment won't go out until next week, it doesn't hurt to wait and check. [17:46:38] 6operations, 10ops-codfw: Create shipment for eqord (router and gear) - https://phabricator.wikimedia.org/T109109#1540475 (10Papaul) The above list is complete just need to box the items. [17:50:44] (03CR) 10Ottomata: "Quick comment on structure, as I am not very familiar with cassandra or restbase configuration." [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [17:59:52] !log deployed latest kartotherian [17:59:55] csteipp, ^ [17:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:22] 6operations, 10ops-codfw: Create shipment for eqord (router and gear) - https://phabricator.wikimedia.org/T109109#1540553 (10RobH) IRC UPDATE: This CANNOT be unplugged, as right now one of the eqiad-codfw links is via cr1-eqord. @Papaul: You'll need to coordinate with @faidon to migrate this link back to the... [18:06:50] 6operations, 10MediaWiki-extensions-CentralAuth, 7Database: Special:GlobalUsers varies between claiming a user is or isn't attached - https://phabricator.wikimedia.org/T102915#1540564 (10Glaisher) 1. Original query when no username is specified: ```lang=sql SELECT gu_id,gu_name,gu_locked,lu_attached_method,C... [18:07:12] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1540566 (10ALantz) @qgil Joady and I are (and have been) working with IT on offboarding. We have several processes and checklists in place. Emailing me about it would be the best way to continue with HR,... [18:08:51] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1540574 (10CCogdill_WMF) After talking about this more with our email consultants, I don't think we have any other option but to chang... [18:15:12] 6operations, 10ops-ulsfo: RIPE Atlas Anchor @ ulsfo is down - https://phabricator.wikimedia.org/T107691#1540586 (10RobH) I'll investigate this next week, as I'll be onsite to also patch in our xconnect from telia. [18:16:41] RECOVERY - Disk space on labstore1002 is OK: DISK OK [18:24:31] !log Repooling mw1041 now that T108601 is resolved. [18:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:27:48] !log ori@tin Synchronized php-1.26wmf18/extensions/MultimediaViewer: 9ee0437bc6: Updated mediawiki/core Project: mediawiki/extensions/MultimediaViewer 645b6c9e93fae13e09e5b493547aecc5a2e933ae (duration: 00m 12s) [18:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:28:33] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1540613 (10ALantz) @qgil @jkrauska After talking with Joady about this: Could IT include Ops with their offboarding form results? Ops has previously requested access to the 2 HR internal tracking docs,... [18:35:10] (03CR) 10MaxSem: Added tilerator service, granted kartotherian OSM DB read access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [18:36:19] 6operations, 10hardware-requests: dbproxy servers for codfw - https://phabricator.wikimedia.org/T109116#1540647 (10RobH) 3NEW a:3jcrespo [18:36:29] akosiaris, i explained why we need SQL access in karotherian in an email [18:38:46] (03CR) 10Alexandros Kosiaris: Tilerator: start ncpu / 2 workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231427 (https://phabricator.wikimedia.org/T108974) (owner: 10Mobrovac) [18:39:45] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch, 7Graphite: Deploy statsd plugin for production elasticsearch & logstash - https://phabricator.wikimedia.org/T90889#1540689 (10chasemp) What statsd plugin is this? I do know diamond has a native elasticsearch poller as well. [18:44:20] (03PS1) 10BBlack: Remove mobile.wp.o [dns] - 10https://gerrit.wikimedia.org/r/231610 (https://phabricator.wikimedia.org/T104942) [18:44:58] (03PS2) 10Milimetric: [WIP] Add an Analytics specific instance of RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) [18:45:18] (03PS2) 10BBlack: mobile-frontend: remove support for dead wap/mobile subdomains [puppet] - 10https://gerrit.wikimedia.org/r/231298 [18:46:16] (03CR) 10BBlack: [C: 032] mobile-frontend: remove support for dead wap/mobile subdomains [puppet] - 10https://gerrit.wikimedia.org/r/231298 (owner: 10BBlack) [18:46:22] (03PS2) 10BBlack: mobile-frontend: remove dead 666-redirect handler [puppet] - 10https://gerrit.wikimedia.org/r/231299 [18:47:05] (03CR) 10BBlack: [C: 032] Remove mobile.wp.o [dns] - 10https://gerrit.wikimedia.org/r/231610 (https://phabricator.wikimedia.org/T104942) (owner: 10BBlack) [18:47:15] (03CR) 10BBlack: [C: 032] mobile-frontend: remove dead 666-redirect handler [puppet] - 10https://gerrit.wikimedia.org/r/231299 (owner: 10BBlack) [18:49:17] (03CR) 10Yurik: Tilerator: start ncpu / 2 workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231427 (https://phabricator.wikimedia.org/T108974) (owner: 10Mobrovac) [18:51:11] (03PS1) 10BBlack: Remove wap/mobile subdomains from redirects [puppet] - 10https://gerrit.wikimedia.org/r/231612 (https://phabricator.wikimedia.org/T104942) [18:52:39] (03CR) 10BBlack: [C: 032] Remove wap/mobile subdomains from redirects [puppet] - 10https://gerrit.wikimedia.org/r/231612 (https://phabricator.wikimedia.org/T104942) (owner: 10BBlack) [18:52:43] (03CR) 10Alexandros Kosiaris: "Hello. Disclaimer: Not sure at all about the architecture of all this as I have" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [18:57:14] (03CR) 10Alexandros Kosiaris: Tilerator: start ncpu / 2 workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231427 (https://phabricator.wikimedia.org/T108974) (owner: 10Mobrovac) [18:59:45] (03PS3) 10Milimetric: [WIP] Add an Analytics specific instance of RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) [19:00:09] akosiaris: thanks for pointing out that patch didn't have an explanation, I updated the commit message, let me know if anything is still unclear [19:00:44] in short, the current approach is my (potentially bad) interpretation of Gabriel and Marko's guidance [19:10:56] (03PS1) 10Tim Landscheidt: ldap: Update ldaplist to new hosts structure [puppet] - 10https://gerrit.wikimedia.org/r/231616 [19:14:31] 6operations, 6Discovery, 7Elasticsearch: Investigate the need for master only (non data nodes) in our ES cluster - https://phabricator.wikimedia.org/T109090#1540827 (10chasemp) [19:15:51] 6operations, 6Discovery, 7Elasticsearch: Cultivating the Elasticsearch garden - https://phabricator.wikimedia.org/T109089#1540830 (10chasemp) [19:25:05] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1540865 (10JKrauska) @alantz I believe it would be most efficient for Ops to have a direct process with HR instead of needing to go through IT to get this critical information. [19:29:01] !log ori@tin Synchronized php-1.26wmf18/includes/resourceloader/ResourceLoader.php: f72009a543: ResourceLoader: apply minify-js filter to config scripts (duration: 00m 13s) [19:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:46] 6operations, 6Discovery, 7Elasticsearch: Investigate mysterious write load during general read-only maintenance - https://phabricator.wikimedia.org/T109127#1540887 (10chasemp) 3NEW [19:40:16] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1540893 (10ALantz) @jkrauska We are happy to give access to our 2 tracking forms to those in Ops that need this information. Just email Joady if there's anyone that needs to be added for this process. A... [19:46:09] 6operations, 6Discovery, 7Elasticsearch: Investigate mysterious write load during general read-only maintenance - https://phabricator.wikimedia.org/T109127#1540904 (10chasemp) [19:49:58] 6operations, 6Discovery, 7Elasticsearch: Investigate the need for master only (non data nodes) in our ES cluster - https://phabricator.wikimedia.org/T109090#1540931 (10chasemp) [19:55:12] 6operations, 6Discovery, 7Elasticsearch: Cultivating the Elasticsearch garden (Lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1540952 (10chasemp) [19:55:16] 6operations: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1540953 (10JKrauska) @ALantz From my perspective we have different teams each wanting to stick to their own process flows and not use each others tools.. :) Ops would like HR to use Phabricator. HR would... [19:57:22] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1540961 (10Aklapper) [19:57:37] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1540966 (10Aklapper) No idea who can act on this, CC'ing Operations (feel free to remove again) [20:02:28] I have one ssh key for production access (mainly stat1003). Is it safe to use the same key for Gerrit access? [20:10:19] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1541019 (10Aklapper) All "Blocked by" tickets closed; what's left here? [20:11:25] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1541029 (10Deskana) 5Open>3Resolved a:3Deskana Thanks @Aklapper. As far as I know, this is resolved. @EBernhardson can correct me if I'm wrong. [20:11:34] 6operations, 10CirrusSearch, 6Discovery, 7Epic: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1541032 (10Deskana) [20:11:53] (03CR) 10Andrew Bogott: [C: 032] "This won't last, since I keep messing with the host ldap schema. But, for now..." [puppet] - 10https://gerrit.wikimedia.org/r/231616 (owner: 10Tim Landscheidt) [20:11:58] (03PS2) 10Andrew Bogott: ldap: Update ldaplist to new hosts structure [puppet] - 10https://gerrit.wikimedia.org/r/231616 (owner: 10Tim Landscheidt) [20:24:07] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Documentation, 7Software-Licensing: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#1541095 (10Ricordisamoa) [20:26:22] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=78%) [20:28:29] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1541120 (10Krenair) Yep, mail aliases are kept in puppet-private so you need to add #operations to get these requests processed. [20:52:22] (03PS1) 10GWicke: Enable /api/ rewrites for mediawiki.org, meta, wikisource & commons [puppet] - 10https://gerrit.wikimedia.org/r/231689 [20:53:27] (03PS2) 10GWicke: Enable /api/ rewrites for mediawiki.org, meta, wikisource & commons [puppet] - 10https://gerrit.wikimedia.org/r/231689 [20:54:27] Does that really cover everything? [20:55:36] Krenair: do you mean the /api/ rewrite? [20:55:39] also, gwicke, you should probably clarify that this is sourceswiki only, not *wikisource [20:56:10] hmm, yeah [20:56:47] yes [20:57:15] it does have the rest api, so should also have the listing: https://wikisource.org/api/rest_v1/ [20:57:21] gwicke, I don't think you've covered everything with this. [20:58:15] that's very possible; I added the projects I know about as missing which have an actual /api/ set up [20:59:06] outreach is another one which springs to mind as a special wiki that should be in restbase but returns 404 to /api/ [20:59:33] Along with many others: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L1315-L1375 [21:00:06] some of which are private so can't have restbase, but some are not [21:02:34] yeah, but having the rewrite wouldn't hurt [21:02:41] (03CR) 10Alex Monk: [C: 04-1] "Not convinced this covers everything." [puppet] - 10https://gerrit.wikimedia.org/r/231689 (owner: 10GWicke) [21:02:53] I can see if I can pull this into wikimedia-common.incl [21:03:44] Looks like that's where it should be. [21:05:34] neilpquinn> I have one ssh key for production access (mainly stat1003). Is it safe to use the same key for Gerrit access? [21:05:42] Is this the right place to ask? [21:07:59] neilpquinn: no, production access and gerrit/wikitech/labs should use separate keys [21:08:01] (03PS3) 10GWicke: Enable /api/ rewrites for mediawiki.org, meta, wikisource & commons [puppet] - 10https://gerrit.wikimedia.org/r/231689 [21:08:11] legoktm, okay thanks! [21:08:51] Krenair: {{done}} [21:11:32] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1431 bytes in 0.151 second response time [21:14:38] hoo: ^ [21:16:36] labs and production key should be different, gerrit key is a third thing and could be either. ttbomk [21:17:20] on it [21:17:27] great [21:17:28] seeems more serious [21:17:31] :/ [21:17:36] jzerebecki: [21:19:58] Huh. [21:20:01] Have there been any recent network changes? [21:20:08] commons.wikimedia.org does not include wikimedia-common.incl [21:20:27] Which is supposed to be for *.wikimedia.org. [21:20:30] Krenair: yeah, most large wikis don't [21:20:42] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [21:21:11] Network in eqiad to db1058 seems super slowe [21:21:18] no packet loss AFAICT, though [21:21:40] s5 master? [21:21:44] yeah [21:21:58] ok, there is packet loss :( [21:22:01] RECOVERY - Host mw2031 is UPING OK - Packet loss = 0%, RTA = 37.67 ms [21:23:05] Only terbium struggles [21:23:22] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1541320 (10RobH) a:3RobH [21:24:20] Back to normal again [21:24:39] seems something just hammered the ethernet on that machine [21:24:42] nvm [21:25:54] see alslo https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=terbium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1439587416&g=network_report&z=large&c=Miscellaneous%20eqiad [21:26:26] !log deployed job runner 808d1ae08d40 [21:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:55] (03CR) 10Ori.livneh: [C: 032] Enable /api/ rewrites for mediawiki.org, meta, wikisource & commons [puppet] - 10https://gerrit.wikimedia.org/r/231689 (owner: 10GWicke) [21:28:23] probably should have updated the title to show it's much more than just those sites [21:28:25] too late now [21:28:54] * gwicke crosses fingers [21:29:03] probably no harm in setting up /api/ on private *.wikimedia.org wikis even though it won't work yet [21:29:18] if you already did *.wikipedia.org it'll already include private wikis there too [21:29:21] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1411 bytes in 0.113 second response time [21:30:11] Krenair: we'll likely merge https://github.com/wikimedia/restbase/pull/272 soonish, which adds simple private wiki support [21:33:31] Krenair: thanks for reviewing! [21:37:46] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1541369 (10RobH) This notice has been emailed to the list owners mailing list, as well as posted on tasks T108099 & T107445. I've scheduled a sodium/mailman downtime wind... [21:37:50] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1512290 (10RobH) This notice has been emailed to the list owners mailing list, as well as posted on tasks T108099 & T107445. I've scheduled a sodium/mailman downtime window for pla... [21:45:32] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:49:31] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [21:51:22] Can someone kick nutcracker on mw1010? [21:51:42] Is eating a lot of CPU and there seem to be problems with memcahed [21:52:32] yeah it's busted [21:53:44] chasemp, mutante: restart nutcracker on mw1010 please, 1.6M errors in the last hour from it [21:53:58] ok [21:54:10] Tx mutante [21:54:12] I have the technology to alert on this now! Time to set that up [21:54:14] !log restarted nutcracker on mw1010 [21:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:54:31] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [21:54:31] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [21:55:06] bd808, memcache or redistribute errors? Or neither [21:56:01] chasemp: graphite counts of error rates that we can alert on. See https://grafana.wikimedia.org/#/dashboard/db/bd808test [21:56:32] Thanks [21:56:49] it popped into the red pretty hard, <100/m to >20K/m [21:57:32] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1541433 (10Tgr) I'll just upload the correct files then: {F1496921} {F1496922} [21:59:31] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [21:59:31] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [22:01:03] bd808, mutante, chasemp: next time it happens, don't restart nutcracker; depool the server and set it aside so we can inspect it [22:01:07] we need to get down to the bottom of this [22:01:20] *nod* that would be nice [22:01:25] Good point [22:01:57] This is the 2nd or 3rd time in as many days I think [22:04:31] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [22:04:31] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [22:04:56] (03PS1) 10Ori.livneh: Backport of D40473: Port strtr from zend 5.6.10 (ebernhardson) [debs/hhvm] - 10https://gerrit.wikimedia.org/r/231698 [22:09:31] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [22:09:31] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [22:14:31] RECOVERY - check_apache2 on payments2002 is OK: PROCS OK: 6 processes with command name apache2 [22:14:31] RECOVERY - check_puppetrun on payments2002 is OK Puppet is currently enabled, last run 151 seconds ago with 0 failures [22:36:34] (03PS1) 10BryanDavis: Add icinga alert for anomalous logstash.rate.mediawiki.memcached.ERROR.count [puppet] - 10https://gerrit.wikimedia.org/r/231704 (https://phabricator.wikimedia.org/T100735) [22:37:53] (03PS2) 10BryanDavis: Add icinga alert for anomalous logstash.rate.mediawiki.memcached.ERROR.count [puppet] - 10https://gerrit.wikimedia.org/r/231704 (https://phabricator.wikimedia.org/T69817) [22:40:10] bd808: any reason not to merge? [22:40:13] it lgtm [22:41:04] No reason that I know of. I can send an ops-l email explaining what to do if it goes off [22:41:41] cool [22:41:54] (03CR) 10Ori.livneh: [C: 032] "Nice work" [puppet] - 10https://gerrit.wikimedia.org/r/231704 (https://phabricator.wikimedia.org/T69817) (owner: 10BryanDavis) [22:43:31] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:44:04] 6operations, 10Wikimedia-Mailing-lists: Rename Advocacy_Advisors@ to publicpolicy@ - https://phabricator.wikimedia.org/T109142#1541608 (10Krenair) [22:44:04] yurik, ebernhardson SMalyshev: https://github.com/MaxSem/PoiMap2/commit/6d5684bcdcb42eb212b18d607518a7fefb38cf96 [22:45:20] MaxSem: looks ok to me [22:47:41] bblack, could you re-enable maps pls [22:47:46] security review finished :) [22:52:09] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch, 7Graphite: Deploy statsd plugin for production elasticsearch & logstash - https://phabricator.wikimedia.org/T90889#1541643 (10EBernhardson) This is https://github.com/ebernhardson/elasticsearch-statsd-plugin/tree/v0.3.3-wmf2 which is a forked... [22:52:32] yurik: where is that? [22:52:58] bblack, https://phabricator.wikimedia.org/T105051 [22:54:03] yurik: what about https://phabricator.wikimedia.org/T105090 ? [22:54:27] 6operations, 10CirrusSearch, 6Discovery, 7Epic: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1541654 (10EBernhardson) everything here is done, afaik [22:54:52] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 13.33% of data above the critical threshold [500.0] [22:55:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [23:00:52] PROBLEM - puppet last run on mw1045 is CRITICAL Puppet has 1 failures [23:02:22] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL 20.00% of data above the critical threshold [10.0] [23:02:35] am here and on it. [23:02:38] tfinc, ^^^^^^ [23:02:51] yurik: hmm ? [23:02:52] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL 20.00% of data above the critical threshold [10.0] [23:03:02] tfinc, re bblack comment on https://phabricator.wikimedia.org/T105090 [23:03:31] yurik: that's the legal ticket [23:04:00] tfinc, is that a blocker to enable it in production? [23:05:24] yurik: let me message stephen to see where he is [23:05:52] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL 25.00% of data above the critical threshold [10.0] [23:07:12] 6operations, 10Wikimedia-Mailing-lists: Rename Advocacy_Advisors@ to publicpolicy@ - https://phabricator.wikimedia.org/T109142#1541736 (10JohnLewis) a:3RobH [23:08:12] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1018 is OK Less than 1.00% above the threshold [1.0] [23:08:43] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK Less than 1.00% above the threshold [1.0] [23:09:42] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK Less than 1.00% above the threshold [1.0] [23:13:22] yurik: steven will be commenting shortly. he doesn't see any legal issues [23:13:29] bblack, ^ [23:13:33] yeppii! :) [23:13:58] i'm asking him to add to the phab ticket so that we do our due diligence [23:14:57] 6operations, 10Wikimedia-Mailing-lists: Rename Advocacy_Advisors@ to publicpolicy@ - https://phabricator.wikimedia.org/T109142#1541768 (10RobH) I'll add it to the list & attempt to get to it. I know I can handle 2 renames (and the associated rebuilds of archives) within the hour. If I run out of time, this o... [23:19:03] (03CR) 10Ori.livneh: [C: 032 V: 032] "Looks great on Labs" [debs/hhvm] - 10https://gerrit.wikimedia.org/r/231698 (owner: 10Ori.livneh) [23:23:52] RECOVERY - puppet last run on mw1045 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [23:25:59] 6operations, 3Discovery-Maps-Sprint: git deploy shows 5 tilerator instances instead of 4 - https://phabricator.wikimedia.org/T108956#1541822 (10Yurik) 5Open>3Resolved a:3Yurik [23:28:04] yurik: https://phabricator.wikimedia.org/T105090 :) [23:28:38] tfinc, excellente, closing it than [23:28:47] bblack, legal is closed :) [23:28:52] (03PS1) 10Ori.livneh: Remove hhvm-fss package from production [puppet] - 10https://gerrit.wikimedia.org/r/231712 (https://phabricator.wikimedia.org/T101418) [23:29:09] (03PS2) 10Ori.livneh: Remove hhvm-fss package from production [puppet] - 10https://gerrit.wikimedia.org/r/231712 (https://phabricator.wikimedia.org/T101418) [23:29:19] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove hhvm-fss package from production [puppet] - 10https://gerrit.wikimedia.org/r/231712 (https://phabricator.wikimedia.org/T101418) (owner: 10Ori.livneh) [23:29:34] zomg [23:34:01] (03PS1) 10Ori.livneh: hhvm: Don't load fss.so [puppet] - 10https://gerrit.wikimedia.org/r/231714 [23:34:10] yurik: ok so we're clear to turn this back on. but I'd just like to be clear: from phab's apparent perspective, with the closing of T105076 that was the last blocker for "Deploy Maps service to production (Q1)" [23:34:14] (03PS2) 10Ori.livneh: hhvm: Don't load fss.so [puppet] - 10https://gerrit.wikimedia.org/r/231714 [23:34:20] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: Don't load fss.so [puppet] - 10https://gerrit.wikimedia.org/r/231714 (owner: 10Ori.livneh) [23:34:51] it's really not production-ready in any scaling/perf sense: it's running on leftover hardware in the wrong places, it's a hack. I don't want us to go start turning on any usage of it on the main wikis in this state, IMHO. [23:35:02] the point of the production entrypoint is to test/demo it [23:35:17] tfinc, ^ [23:35:46] (since apparently we couldn't really do that in labs, so we're using prod test/leftover hardware) [23:36:24] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:36:42] bblack, at some point we might want to dedicate more servers once we start showing it. Also, in case we do decide that it is production worthy, we might switch two of the servers to be varnishes in codfw, thus satisfying the cross-cluster req [23:37:13] I don't know what kind of distinction or semantics there should be there about the difference between "make it work at all in prod to test/demo/whatever" and "this is a production service to really be used by production sites as a maps backend" [23:37:32] but there needs to a be a real distinction there in practice, and we do need to revisit where and what hardware we're deploying on before the latter. [23:37:35] well, no use in any production wiki, for one [23:37:52] well yeah [23:38:06] greg-g, not exactly - because we use labs in production wiki, which is by far worse than this [23:38:14] PROBLEM - Apache HTTP on mw1255 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.017 second response time [23:38:20] I just mean, I don't know what the right terminology is for drawing the distinction between "production: the place it's running at" and "production: the place it can be used from and advertised at" :) [23:38:22] yurik: define "we" [23:38:33] sorry, community [23:38:45] WMF does not sure WMF Labs for anything production [23:38:49] all those maps popups in en and other wikis come directly from labs [23:39:07] clicking on geo coords takes you to labs geohack page [23:39:09] that might be bad, but it's orthogonal to all of this [23:39:10] and many other things [23:39:15] PROBLEM - Apache HTTP on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.507 second response time [23:39:15] PROBLEM - HHVM rendering on mw1144 is CRITICAL - Socket timeout after 10 seconds [23:39:15] PROBLEM - HHVM rendering on mw1052 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.125 second response time [23:39:16] so, my statement still stands, no use of a pre-prod service in prod. period. [23:39:23] ^ agreed [23:39:23] PROBLEM - Apache HTTP on mw1123 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.445 second response time [23:39:23] PROBLEM - HHVM rendering on mw1123 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.139 second response time [23:39:24] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:39:33] argh [23:39:34] PROBLEM - Apache HTTP on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [23:39:38] wtf [23:39:53] ori: known issue? fss.so? [23:39:58] could be, looking [23:40:01] k [23:40:24] PROBLEM - HHVM rendering on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.032 second response time [23:40:53] PROBLEM - Disk space on mw1114 is CRITICAL: DISK CRITICAL - free space: / 6711 MB (3% inode=93%) [23:41:04] PROBLEM - puppet last run on analytics1012 is CRITICAL puppet fail [23:41:13] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time [23:41:13] RECOVERY - HHVM rendering on mw1052 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.421 second response time [23:41:39] yurik: putting this in production means a firm commitment to making it scale and be reliable for real use, and we basically have no time or money budget on that AFAIK, and we're using whatever hardware we could scrap in a suboptimal way. the plan forward from there isn't "slowly steal or re-arrange more hacks/hardware as load grows when we start using it" - it's have a real plan for a real de [23:41:43] PROBLEM - Apache HTTP on mw1037 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:41:45] ployment before we start using it. [23:42:23] PROBLEM - HHVM rendering on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.975 second response time [23:42:30] we may be able to find ways to do that relatively cheaply, but it's a discussion that has to happen first, puppet changes and deployments that have to happen, and a further decision point before real prod usage [23:42:34] PROBLEM - HHVM rendering on mw1237 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:42:34] bblack, money is unfortunately out of my hands, tfinc? [23:42:48] ugh [23:42:50] there's a bug in /srv/mediawiki/php-1.26wmf18/includes/libs/ReplacementArray.php [23:42:51] fixing [23:42:54] PROBLEM - HHVM rendering on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.028 second response time [23:42:54] PROBLEM - HHVM rendering on mw1037 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.652 second response time [23:43:06] well we can take that offline. my point is, flipping maps.wm.o back on is not a license to start ramping in any real usage of it. we need a separate conversation about that. [23:43:07] getting 503 errors on en.wiki [23:43:07] bblack: would you say its enough for our community to test, experiment, and iterate on? [23:43:23] PROBLEM - Apache HTTP on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:43:23] PROBLEM - HHVM rendering on mw1188 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [23:43:24] PROBLEM - HHVM rendering on mw1100 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [23:43:24] PROBLEM - HHVM rendering on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.023 second response time [23:43:24] PROBLEM - Apache HTTP on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.017 second response time [23:43:24] PROBLEM - Apache HTTP on mw1234 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.472 second response time [23:43:24] PROBLEM - Apache HTTP on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.367 second response time [23:43:25] PROBLEM - HHVM rendering on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:43:25] PROBLEM - Apache HTTP on mw1134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.828 second response time [23:43:26] PROBLEM - Apache HTTP on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.022 second response time [23:43:26] PROBLEM - Apache HTTP on mw1164 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:43:27] PROBLEM - Apache HTTP on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [23:43:33] PROBLEM - HHVM rendering on mw1057 is CRITICAL - Socket timeout after 10 seconds [23:43:33] PROBLEM - Apache HTTP on mw1128 is CRITICAL - Socket timeout after 10 seconds [23:43:33] PROBLEM - HHVM rendering on mw1142 is CRITICAL - Socket timeout after 10 seconds [23:43:33] PROBLEM - HHVM rendering on mw1147 is CRITICAL - Socket timeout after 10 seconds [23:43:34] PROBLEM - HHVM rendering on mw1072 is CRITICAL - Socket timeout after 10 seconds [23:43:34] PROBLEM - Apache HTTP on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.029 second response time [23:43:34] PROBLEM - Apache HTTP on mw1129 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:43:35] PROBLEM - Apache HTTP on mw1175 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:43:35] PROBLEM - Apache HTTP on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [23:43:35] each new service deployment incures a real (and immensely non-zero) cost in operations and maint etc. If there isn't the budget for it long term, we shouldn't reduce other people's budgets (ops, in time and maint) just to test [23:43:36] PROBLEM - Apache HTTP on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.033 second response time [23:43:47] PROBLEM - HHVM rendering on mw1131 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.295 second response time [23:43:47] PROBLEM - Apache HTTP on mw1150 is CRITICAL - Socket timeout after 10 seconds [23:43:48] PROBLEM - HHVM rendering on mw1110 is CRITICAL - Socket timeout after 10 seconds [23:43:48] PROBLEM - Apache HTTP on mw1046 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:43:49] PROBLEM - Apache HTTP on mw1218 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:43:49] PROBLEM - Apache HTTP on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [23:43:50] PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.036 second response time [23:43:50] PROBLEM - Apache HTTP on mw1171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:43:51] PROBLEM - HHVM rendering on mw1050 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 3.958 second response time [23:43:51] PROBLEM - HHVM rendering on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:43:52] PROBLEM - HHVM rendering on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.097 second response time [23:43:52] PROBLEM - Apache HTTP on mw1167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.020 second response time [23:43:53] PROBLEM - HHVM rendering on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.019 second response time [23:43:53] PROBLEM - Apache HTTP on mw1049 is CRITICAL - Socket timeout after 10 seconds [23:43:54] PROBLEM - Apache HTTP on mw1096 is CRITICAL - Socket timeout after 10 seconds [23:43:54] PROBLEM - HHVM rendering on mw1178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.132 second response time [23:43:55] PROBLEM - HHVM rendering on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.029 second response time [23:43:55] PROBLEM - Apache HTTP on mw1088 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 9.958 second response time [23:44:06] ori: revert revert? [23:44:08] PROBLEM - HHVM rendering on mw1227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:44:08] PROBLEM - HHVM rendering on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [23:44:09] PROBLEM - Apache HTTP on mw1035 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 3.015 second response time [23:44:09] PROBLEM - Apache HTTP on mw1109 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 3.008 second response time [23:44:09] yeah [23:44:10] PROBLEM - Apache HTTP on mw1236 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:44:10] PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 3.636 second response time [23:44:11] PROBLEM - HHVM rendering on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 3.635 second response time [23:44:11] PROBLEM - Apache HTTP on mw1230 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.680 second response time [23:44:12] PROBLEM - Apache HTTP on mw1173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.121 second response time [23:44:12] PROBLEM - Apache HTTP on mw1148 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.383 second response time [23:44:13] PROBLEM - Apache HTTP on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:44:13] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 50.00% of data above the critical threshold [500.0] [23:44:14] PROBLEM - HHVM rendering on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.021 second response time [23:44:14] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:44:15] PROBLEM - Apache HTTP on mw1100 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.126 second response time [23:44:15] PROBLEM - Apache HTTP on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:44:16] PROBLEM - HHVM rendering on mw1236 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [23:44:16] PROBLEM - HHVM rendering on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:44:28] PROBLEM - HHVM rendering on mw1232 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.015 second response time [23:44:28] PROBLEM - HHVM rendering on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:44:29] PROBLEM - HHVM rendering on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.134 second response time [23:44:29] PROBLEM - Apache HTTP on mw1107 is CRITICAL - Socket timeout after 10 seconds [23:44:30] PROBLEM - HHVM rendering on mw1027 is CRITICAL - Socket timeout after 10 seconds [23:44:30] PROBLEM - Apache HTTP on mw1091 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 3.008 second response time [23:44:31] PROBLEM - Apache HTTP on mw1145 is CRITICAL - Socket timeout after 10 seconds [23:44:31] PROBLEM - Apache HTTP on mw1085 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 8.346 second response time [23:44:32] PROBLEM - HHVM rendering on mw1150 is CRITICAL - Socket timeout after 10 seconds [23:44:32] PROBLEM - HHVM rendering on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:44:33] PROBLEM - Apache HTTP on mw1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:44:33] PROBLEM - HHVM rendering on mw1047 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [23:44:34] PROBLEM - HHVM rendering on mw1029 is CRITICAL - Socket timeout after 10 seconds [23:44:34] PROBLEM - Apache HTTP on mw1092 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:45:05] this is a full outage at this point. load.php is gone so even cached pages are unstyled [23:45:55] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.170 second response time [23:45:55] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [23:45:56] PROBLEM - Apache HTTP on mw1074 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:45:56] RECOVERY - HHVM rendering on mw1185 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.233 second response time [23:45:57] PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:45:57] PROBLEM - HHVM rendering on mw1071 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.372 second response time [23:45:58] PROBLEM - Apache HTTP on mw1105 is CRITICAL - Socket timeout after 10 seconds [23:45:58] PROBLEM - HHVM rendering on mw1171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.776 second response time [23:46:03] PROBLEM - Apache HTTP on mw1099 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [23:46:03] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.019 second response time [23:46:03] RECOVERY - HHVM rendering on mw1178 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.900 second response time [23:46:04] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.080 second response time [23:46:04] PROBLEM - HHVM rendering on mw1205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:46:04] PROBLEM - Apache HTTP on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.979 second response time [23:46:05] PROBLEM - HHVM rendering on mw1075 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.014 second response time [23:46:05] RECOVERY - Apache HTTP on mw1165 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.094 second response time [23:46:05] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [23:46:06] RECOVERY - HHVM rendering on mw1183 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.270 second response time [23:46:13] PROBLEM - HHVM rendering on mw1125 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.896 second response time [23:46:13] PROBLEM - HHVM rendering on mw1136 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:46:13] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [23:46:13] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.609 second response time [23:46:14] PROBLEM - HHVM rendering on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:46:14] RECOVERY - Apache HTTP on mw1255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.539 second response time [23:46:14] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [23:46:15] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [23:46:15] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.072 second response time [23:46:16] PROBLEM - Apache HTTP on mw1059 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.743 second response time [23:46:16] PROBLEM - HHVM rendering on mw1255 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:46:17] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:46:23] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [23:46:23] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.295 second response time [23:46:25] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [23:46:25] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.113 second response time [23:46:25] RECOVERY - HHVM rendering on mw1113 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.315 second response time [23:46:25] PROBLEM - Apache HTTP on mw1044 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:46:25] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 1.158 second response time [23:46:25] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [23:46:25] PROBLEM - Apache HTTP on mw1115 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.063 second response time [23:46:26] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-html/{title} is CRITICAL: Test retrieve en.wp main page via mobile-html returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html-sections/{title} is CRITICAL: Test retrieve en.wp main page via mobile-html-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html-sections-remaining/{title} i [23:46:37] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [23:46:37] PROBLEM - HHVM rendering on mw1250 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:46:38] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.259 second response time [23:46:38] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.190 second response time [23:46:39] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.434 second response time [23:46:39] PROBLEM - Apache HTTP on mw1079 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.150 second response time [23:46:43] PROBLEM - Apache HTTP on mw1144 is CRITICAL - Socket timeout after 10 seconds [23:46:43] PROBLEM - HHVM rendering on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:46:44] PROBLEM - Apache HTTP on mw1057 is CRITICAL - Socket timeout after 10 seconds [23:46:44] PROBLEM - Apache HTTP on mw1178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.019 second response time [23:46:44] PROBLEM - Apache HTTP on mw1063 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:46:44] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.488 second response time [23:46:44] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.404 second response time [23:46:45] PROBLEM - HHVM rendering on mw1073 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:46:45] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.411 second response time [23:46:46] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.223 second response time [23:46:46] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:46:54] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.106 second response time [23:46:55] PROBLEM - Apache HTTP on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:46:55] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.120 second response time [23:46:55] RECOVERY - HHVM rendering on mw1041 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.839 second response time [23:46:55] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-html-sections/{title} is CRITICAL: Test retrieve en.wp main page via mobile-html-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-text/{title} is CRITICAL: Test retrieve the lite en.wp main page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html-sections-remaining/{title} is CRITI [23:46:55] PROBLEM - HHVM rendering on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [23:46:56] RECOVERY - HHVM rendering on mw1092 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 1.160 second response time [23:46:56] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.095 second response time [23:46:57] RECOVERY - HHVM rendering on mw1049 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.788 second response time [23:46:57] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.123 second response time [23:46:59] !log ori@tin Synchronized php-1.26wmf18/includes/libs/ReplacementArray.php: (no message) (duration: 00m 27s) [23:47:03] PROBLEM - Apache HTTP on mw1068 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 6.863 second response time [23:47:03] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.090 second response time [23:47:03] PROBLEM - Apache HTTP on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:47:04] PROBLEM - Apache HTTP on mw1249 is CRITICAL: Connection timed out [23:47:04] PROBLEM - HHVM rendering on mw1179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.656 second response time [23:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:04] PROBLEM - Apache HTTP on mw1043 is CRITICAL - Socket timeout after 10 seconds [23:47:04] PROBLEM - Apache HTTP on mw1087 is CRITICAL - Socket timeout after 10 seconds [23:47:05] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:47:05] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.271 second response time [23:47:06] PROBLEM - HHVM rendering on mw1173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [23:47:13] PROBLEM - HHVM rendering on mw1077 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:47:14] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.643 second response time [23:47:14] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [23:47:14] RECOVERY - HHVM rendering on mw1087 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.255 second response time [23:47:14] RECOVERY - HHVM rendering on mw1081 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.289 second response time [23:47:14] RECOVERY - HHVM rendering on mw1044 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.333 second response time [23:47:14] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.005 second response time [23:47:15] PROBLEM - HHVM rendering on mw1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.154 second response time [23:47:15] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.597 second response time [23:47:16] PROBLEM - Apache HTTP on mw1147 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.005 second response time [23:47:16] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.471 second response time [23:47:17] PROBLEM - HHVM rendering on mw1068 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.257 second response time [23:47:17] RECOVERY - HHVM rendering on mw1052 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 1.114 second response time [23:47:33] PROBLEM - Apache HTTP on mw1060 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.950 second response time [23:47:33] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.067 second response time [23:47:34] PROBLEM - Apache HTTP on mw1039 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 9.888 second response time [23:47:34] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.105 second response time [23:47:34] PROBLEM - Apache HTTP on mw1170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.607 second response time [23:47:34] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [23:47:35] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [23:47:43] PROBLEM - Apache HTTP on mw1137 is CRITICAL - Socket timeout after 10 seconds [23:47:43] PROBLEM - Apache HTTP on mw1126 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.428 second response time [23:47:44] PROBLEM - HHVM rendering on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:47:44] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [23:47:44] PROBLEM - Apache HTTP on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.286 second response time [23:47:44] PROBLEM - Apache HTTP on mw1078 is CRITICAL - Socket timeout after 10 seconds [23:47:44] PROBLEM - HHVM rendering on mw1233 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:47:45] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.419 second response time [23:47:45] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50793 bytes in 0.110 second response time [23:47:49] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [23:47:50] What just happened, bblack? [23:47:53] PROBLEM - Apache HTTP on mw1056 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.848 second response time [23:47:54] PROBLEM - Apache HTTP on mw1212 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:47:58] Bsadowski1: i killed the site [23:48:00] Rouge patch? [23:48:00] it's recovering now [23:48:03] RECOVERY - HHVM rendering on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.271 second response time [23:48:03] PROBLEM - HHVM rendering on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [23:48:04] PROBLEM - HHVM rendering on mw1176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.946 second response time [23:48:04] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8935 bytes in 0.467 second response time [23:48:04] lol [23:48:06] i'll send a postmortem and post it to wikitech [23:48:08] PROBLEM - Apache HTTP on mw1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:48:08] PROBLEM - Apache HTTP on mw1041 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:48:08] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.825 second response time [23:48:08] PROBLEM - Apache HTTP on mw1142 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.170 second response time [23:48:08] PROBLEM - Apache HTTP on mw1076 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.477 second response time [23:48:09] PROBLEM - Apache HTTP on mw1225 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:48:13] PROBLEM - HHVM rendering on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 4.400 second response time [23:48:14] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.572 second response time [23:48:14] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.458 second response time [23:48:14] PROBLEM - Apache HTTP on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.020 second response time [23:48:14] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.249 second response time [23:48:24] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.134 second response time [23:48:24] PROBLEM - HHVM rendering on mw1043 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.313 second response time [23:48:24] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.085 second response time [23:48:24] RECOVERY - Apache HTTP on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.954 second response time [23:48:25] PROBLEM - Apache HTTP on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.344 second response time [23:48:25] PROBLEM - Apache HTTP on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.576 second response time [23:48:25] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [23:48:25] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.028 second response time [23:48:25] PROBLEM - Apache HTTP on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:48:33] PROBLEM - HHVM rendering on mw1170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.651 second response time [23:48:33] PROBLEM - Apache HTTP on mw1054 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.918 second response time [23:48:33] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [23:48:33] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [23:48:33] RECOVERY - HHVM rendering on mw1150 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.370 second response time [23:48:34] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [23:48:34] PROBLEM - Apache HTTP on mw1067 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 4.867 second response time [23:48:35] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [23:48:35] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [23:48:43] PROBLEM - HHVM rendering on mw1034 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 9.291 second response time [23:48:43] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.058 second response time [23:48:44] PROBLEM - HHVM rendering on mw1243 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:48:44] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.054 second response time [23:48:44] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.134 second response time [23:48:44] PROBLEM - Apache HTTP on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.138 second response time [23:48:44] RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.394 second response time [23:48:45] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.322 second response time [23:48:45] PROBLEM - Apache HTTP on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.273 second response time [23:48:46] PROBLEM - HHVM rendering on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:48:46] PROBLEM - HHVM rendering on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [23:48:47] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.431 second response time [23:48:52] mediawiki.org is down? [23:48:53] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.244 second response time [23:48:53] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [23:48:53] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50826 bytes in 0.142 second response time [23:48:57] PROBLEM - HHVM rendering on mw1102 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [23:48:57] see /topic! :) [23:48:58] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [23:49:03] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:49:03] PROBLEM - HHVM rendering on mw1111 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 6.602 second response time [23:49:04] PROBLEM - Apache HTTP on mw1062 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 4.617 second response time [23:49:04] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [23:49:04] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.070 second response time [23:49:04] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.075 second response time [23:49:04] PROBLEM - Apache HTTP on mw1028 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.358 second response time [23:49:05] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.087 second response time [23:49:05] PROBLEM - HHVM rendering on mw1238 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:49:06] PROBLEM - HHVM rendering on mw1055 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.029 second response time [23:49:06] PROBLEM - HHVM rendering on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:49:07] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [23:49:07] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.254 second response time [23:49:08] RECOVERY - HHVM rendering on mw1108 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.515 second response time [23:49:11] lol [23:49:13] niedzielski: everything's down [23:49:13] bblack: thanks! :) [23:49:19] PROBLEM - HHVM rendering on mw1078 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.526 second response time [23:49:23] PROBLEM - HHVM rendering on mw1101 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 4.829 second response time [23:49:23] PROBLEM - HHVM rendering on mw1080 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.084 second response time [23:49:23] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.275 second response time [23:49:23] PROBLEM - HHVM rendering on mw1097 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.621 second response time [23:49:23] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.367 second response time [23:49:24] PROBLEM - HHVM rendering on mw1109 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.415 second response time [23:49:24] PROBLEM - HHVM rendering on mw1104 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [23:49:25] PROBLEM - HHVM rendering on mw1083 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.490 second response time [23:49:25] PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.368 second response time [23:49:26] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.140 second response time [23:49:26] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.435 second response time [23:49:27] RECOVERY - HHVM rendering on mw1068 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.998 second response time [23:49:27] PROBLEM - Apache HTTP on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:49:27] The world is about to esplode [23:49:28] PROBLEM - HHVM rendering on mw1088 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.682 second response time [23:49:42] I'm randomly getting the old 503 error page and the new one [23:49:43] PROBLEM - Apache HTTP on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:49:43] RECOVERY - Apache HTTP on mw1258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.411 second response time [23:49:44] PROBLEM - Apache HTTP on mw1134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 8.370 second response time [23:49:44] PROBLEM - Apache HTTP on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.285 second response time [23:49:44] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.059 second response time [23:49:45] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.100 second response time [23:49:45] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [23:49:45] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.060 second response time [23:49:45] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.032 second response time [23:49:46] RECOVERY - HHVM rendering on mw1040 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.387 second response time [23:49:46] PROBLEM - HHVM rendering on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [23:49:47] PROBLEM - Apache HTTP on mw1164 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.334 second response time [23:49:47] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.229 second response time [23:49:50] * Tippopotamus needs someone to hold him [23:49:54] PROBLEM - Apache HTTP on mw1238 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:49:54] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.064 second response time [23:49:54] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.261 second response time [23:49:54] RECOVERY - HHVM rendering on mw1174 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.429 second response time [23:49:54] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.293 second response time [23:49:54] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [23:49:55] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.030 second response time [23:49:55] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.355 second response time [23:49:56] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.365 second response time [23:49:57] I thought we had switched over to the new one? [23:50:13] PROBLEM - HHVM rendering on mw1120 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.764 second response time [23:50:13] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.072 second response time [23:50:13] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 73009 bytes in 0.638 second response time [23:50:15] the new one is synthesized by varnish, the old one is straight from MW, I think? [23:50:17] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.206 second response time [23:50:17] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:50:17] PROBLEM - Apache HTTP on mw1052 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.269 second response time [23:50:18] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.060 second response time [23:50:18] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [23:50:18] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.084 second response time [23:50:18] RECOVERY - HHVM rendering on mw1149 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.485 second response time [23:50:19] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.571 second response time [23:50:23] PROBLEM - HHVM rendering on mw1095 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.316 second response time [23:50:23] PROBLEM - Apache HTTP on mw1255 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.023 second response time [23:50:23] PROBLEM - Apache HTTP on mw1230 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:50:23] PROBLEM - Apache HTTP on mw1173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:50:23] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:50:24] PROBLEM - Apache HTTP on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [23:50:24] depends which part throws the error [23:50:33] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [23:50:33] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [23:50:34] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.126 second response time [23:50:34] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.339 second response time [23:50:34] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.565 second response time [23:50:34] PROBLEM - Apache HTTP on mw1077 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:50:34] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.601 second response time [23:50:35] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.197 second response time [23:50:35] PROBLEM - Apache HTTP on mw1084 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 4.143 second response time [23:50:40] We need to get the engineers back from their vacations [23:50:43] PROBLEM - Apache HTTP on mw1091 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.621 second response time [23:50:43] PROBLEM - HHVM rendering on mw1146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.158 second response time [23:50:43] PROBLEM - Apache HTTP on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 8.217 second response time [23:50:44] PROBLEM - HHVM rendering on mw1098 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:50:44] PROBLEM - HHVM rendering on mw1064 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.403 second response time [23:50:44] PROBLEM - Apache HTTP on mw1121 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 4.488 second response time [23:50:44] PROBLEM - HHVM rendering on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 5.723 second response time [23:50:45] PROBLEM - HHVM rendering on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:50:45] PROBLEM - HHVM rendering on mw1221 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:50:46] PROBLEM - HHVM rendering on mw1254 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:50:46] PROBLEM - Apache HTTP on mw1098 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:50:47] PROBLEM - Apache HTTP on mw1064 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:50:47] PROBLEM - HHVM rendering on mw1223 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:50:48] PROBLEM - HHVM rendering on mw1184 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [23:50:55] (03PS1) 10Ori.livneh: Revert "hhvm: Don't load fss.so" [puppet] - 10https://gerrit.wikimedia.org/r/231720 [23:51:03] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "hhvm: Don't load fss.so" [puppet] - 10https://gerrit.wikimedia.org/r/231720 (owner: 10Ori.livneh) [23:51:04] PROBLEM - Apache HTTP on mw1095 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [23:51:04] PROBLEM - HHVM rendering on mw1041 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [23:51:05] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50835 bytes in 0.766 second response time [23:51:09] PROBLEM - Apache HTTP on mw1131 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:51:09] PROBLEM - HHVM rendering on mw1092 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.030 second response time [23:51:09] PROBLEM - HHVM rendering on mw1148 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:51:09] PROBLEM - HHVM rendering on mw1203 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:51:09] PROBLEM - Apache HTTP on mw1110 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:51:09] PROBLEM - Apache HTTP on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.025 second response time [23:51:13] PROBLEM - HHVM rendering on mw1049 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:51:13] PROBLEM - Apache HTTP on mw1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.019 second response time [23:51:23] PROBLEM - HHVM rendering on mw1217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [23:51:24] PROBLEM - HHVM rendering on mw1192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:51:24] PROBLEM - HHVM rendering on mw1172 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.019 second response time [23:51:24] PROBLEM - Apache HTTP on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:51:25] PROBLEM - HHVM rendering on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:51:25] PROBLEM - HHVM rendering on mw1116 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:51:25] (03PS1) 10Ori.livneh: Revert "Remove hhvm-fss package from production" [puppet] - 10https://gerrit.wikimedia.org/r/231721 [23:51:25] PROBLEM - Apache HTTP on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.022 second response time [23:51:25] PROBLEM - HHVM rendering on mw1044 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [23:51:25] PROBLEM - Apache HTTP on mw1027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [23:51:26] PROBLEM - HHVM rendering on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.036 second response time [23:51:26] PROBLEM - HHVM rendering on mw1035 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:51:27] PROBLEM - HHVM rendering on mw1052 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [23:51:33] PROBLEM - HHVM rendering on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:51:33] PROBLEM - Apache HTTP on mw1073 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:51:33] PROBLEM - Apache HTTP on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:51:33] PROBLEM - HHVM rendering on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:51:34] PROBLEM - HHVM rendering on mw1128 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:51:34] PROBLEM - Apache HTTP on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [23:51:35] PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:51:43] PROBLEM - Apache HTTP on mw1227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time [23:51:43] PROBLEM - Apache HTTP on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:51:43] PROBLEM - HHVM rendering on mw1142 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.027 second response time [23:51:43] PROBLEM - Apache HTTP on mw1161 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [23:51:43] PROBLEM - Apache HTTP on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.023 second response time [23:51:44] PROBLEM - Apache HTTP on mw1234 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:51:44] PROBLEM - HHVM rendering on mw1147 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.022 second response time [23:51:44] (03PS2) 10Ori.livneh: Revert "Remove hhvm-fss package from production" [puppet] - 10https://gerrit.wikimedia.org/r/231721 [23:51:45] PROBLEM - Apache HTTP on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:51:45] PROBLEM - Apache HTTP on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:51:54] PROBLEM - Apache HTTP on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:51:54] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Remove hhvm-fss package from production" [puppet] - 10https://gerrit.wikimedia.org/r/231721 (owner: 10Ori.livneh) [23:51:54] PROBLEM - Apache HTTP on mw1219 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [23:51:54] PROBLEM - HHVM rendering on mw1195 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:51:54] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:51:54] PROBLEM - HHVM busy threads on mw1103 is CRITICAL 83.33% of data above the critical threshold [86.4] [23:51:55] PROBLEM - Apache HTTP on mw1070 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:52:02] ori ^ niedzielski i see it too here from the office [23:52:03] PROBLEM - Apache HTTP on mw1105 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:52:04] PROBLEM - HHVM rendering on mw1131 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:52:04] PROBLEM - Apache HTTP on mw1101 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [23:52:04] PROBLEM - Apache HTTP on mw1104 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.027 second response time [23:52:14] PROBLEM - HHVM rendering on mw1171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:52:14] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:52:14] PROBLEM - Apache HTTP on mw1112 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:52:15] PROBLEM - Apache HTTP on mw1108 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.022 second response time [23:52:16] PROBLEM - Apache HTTP on mw1214 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:52:24] PROBLEM - HHVM rendering on mw1183 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [23:52:24] PROBLEM - Apache HTTP on mw1083 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:52:24] PROBLEM - HHVM rendering on mw1136 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:52:24] PROBLEM - Apache HTTP on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.158 second response time [23:52:24] PROBLEM - HHVM rendering on mw1096 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [23:52:25] PROBLEM - HHVM rendering on mw1227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [23:52:25] PROBLEM - Apache HTTP on mw1100 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [23:52:25] PROBLEM - Apache HTTP on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:52:26] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:52:34] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:52:34] PROBLEM - Apache HTTP on mw1222 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [23:52:35] PROBLEM - HHVM rendering on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:52:35] PROBLEM - Apache HTTP on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:52:35] PROBLEM - Apache HTTP on mw1044 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [23:52:35] PROBLEM - Apache HTTP on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.021 second response time [23:52:35] PROBLEM - Apache HTTP on mw1115 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.019 second response time [23:52:36] PROBLEM - HHVM rendering on mw1150 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:52:44] PROBLEM - HHVM rendering on mw2151 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.143 second response time [23:52:44] PROBLEM - Apache HTTP on mw2168 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.139 second response time [23:52:44] PROBLEM - Apache HTTP on mw1127 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.021 second response time [23:52:44] PROBLEM - HHVM rendering on mw2150 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.141 second response time [23:52:44] PROBLEM - HHVM rendering on mw2178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.148 second response time [23:53:18] Reverse the polarity! [23:53:31] don't cross the streams? [23:53:58] maybe we should cross the streams? [23:54:14] dr0ptp4kt: yeah, it looks like even wikipedia.org is effected :( [23:54:16] You want to blow up the entire encyclopedia? [23:54:29] lool [23:54:43] Oh nooo [23:54:44] https://de.wikipedia.org/wiki/Wikipedia:Hauptseite [23:54:45] Down [23:54:51] Is it because I merged the main page with AIV? [23:54:55] no [23:55:01] RECOVERY - HHVM rendering on mw2157 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.750 second response time [23:55:01] RECOVERY - Apache HTTP on mw2076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.114 second response time [23:55:02] RECOVERY - HHVM rendering on mw2209 is OK: HTTP OK: HTTP/1.1 200 OK - 73848 bytes in 0.997 second response time [23:55:02] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [23:55:03] RECOVERY - HHVM rendering on mw1082 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.403 second response time [23:55:03] RECOVERY - Apache HTTP on mw2020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.119 second response time [23:55:04] RECOVERY - HHVM rendering on mw1045 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.280 second response time [23:55:04] RECOVERY - HHVM rendering on mw2036 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.446 second response time [23:55:05] Chillum, take a time out [23:55:05] RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.459 second response time [23:55:13] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50868 bytes in 0.757 second response time [23:55:16] use dynamite to fix the wiki [23:55:17] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.571 second response time [23:55:17] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.718 second response time [23:55:18] RECOVERY - HHVM rendering on mw1059 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.242 second response time [23:55:18] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.837 second response time [23:55:18] RECOVERY - HHVM rendering on mw1060 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 2.848 second response time [23:55:18] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.238 second response time [23:55:20] lots of dynamite [23:55:23] RECOVERY - HHVM rendering on mw1061 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.260 second response time [23:55:33] PROBLEM - graphoid endpoints health on sca1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [23:55:48] niedzielski: ori is reviewing [23:55:53] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.792 second response time [23:56:04] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.506 second response time [23:56:04] RECOVERY - HHVM rendering on mw2050 is OK: HTTP OK: HTTP/1.1 200 OK - 73849 bytes in 1.805 second response time [23:56:14] PROBLEM - graphoid endpoints health on sca1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [23:56:15] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 50727 bytes in 0.082 second response time [23:56:23] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.769 second response time [23:56:24] RECOVERY - HHVM rendering on mw2045 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.615 second response time [23:56:33] RECOVERY - HHVM rendering on mw2158 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.492 second response time [23:56:33] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 503 (expecting: 200) [23:56:34] RECOVERY - Apache HTTP on mw2018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.135 second response time [23:56:34] PROBLEM - Apache HTTP on mw1233 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.004 second response time [23:56:34] RECOVERY - HHVM rendering on mw1112 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 3.381 second response time [23:56:34] RECOVERY - Apache HTTP on mw2045 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.583 second response time [23:56:43] RECOVERY - Apache HTTP on mw2207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.100 second response time [23:56:44] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.064 second response time [23:56:54] RECOVERY - Apache HTTP on mw2146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.123 second response time [23:56:54] RECOVERY - Apache HTTP on mw2120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.481 second response time [23:56:58] is this good or bad? [23:57:02] good [23:57:03] RECOVERY - Apache HTTP on mw2136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.129 second response time [23:57:03] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 7.414 second response time [23:57:10] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.489 second response time [23:57:10] RECOVERY - HHVM rendering on mw1073 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 5.142 second response time [23:57:10] RECOVERY - HHVM rendering on mw2024 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.450 second response time [23:57:11] RECOVERY - HHVM rendering on mw2120 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.535 second response time [23:57:11] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.083 second response time [23:57:13] it's about hhvm packages and the app servers are coming back [23:57:13] It's critical, I can tell you that [23:57:13] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 72902 bytes in 0.549 second response time [23:57:16] but does it come with frogurt? [23:57:17] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 73010 bytes in 0.598 second response time [23:57:21] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.019 second response time [23:57:21] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [23:57:21] RECOVERY - HHVM rendering on mw1158 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.410 second response time [23:57:22] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.344 second response time [23:57:22] RECOVERY - HHVM rendering on mw2095 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.412 second response time [23:57:33] PROBLEM - Apache HTTP on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.021 second response time [23:57:33] PROBLEM - HHVM rendering on mw2186 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.142 second response time [23:57:34] RECOVERY - Apache HTTP on mw2158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.183 second response time [23:57:34] RECOVERY - Apache HTTP on mw2129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.578 second response time [23:57:34] RECOVERY - HHVM rendering on mw1172 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.269 second response time [23:57:34] RECOVERY - HHVM rendering on mw2073 is OK: HTTP OK: HTTP/1.1 200 OK - 72492 bytes in 4.534 second response time [23:57:35] PROBLEM - HHVM rendering on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:57:43] folks, Comms is asking if we can send out a @wikimedia/wikipedia tweet https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public ... [23:57:43] RECOVERY - graphoid endpoints health on sca1002 is OK: All endpoints are healthy [23:57:43] PROBLEM - HHVM rendering on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [23:57:43] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.077 second response time [23:57:43] RECOVERY - Apache HTTP on mw2126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.864 second response time [23:57:44] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.089 second response time [23:57:44] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 1.937 second response time [23:57:44] PROBLEM - Apache HTTP on mw2027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.143 second response time [23:57:45] PROBLEM - Apache HTTP on mw1030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [23:57:46] PROBLEM - Apache HTTP on mw2202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.145 second response time [23:57:53] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.968 second response time [23:57:53] RECOVERY - HHVM rendering on mw2126 is OK: HTTP OK: HTTP/1.1 200 OK - 72492 bytes in 2.760 second response time [23:57:53] PROBLEM - Apache HTTP on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.022 second response time [23:57:53] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.086 second response time [23:57:54] RECOVERY - Apache HTTP on mw2050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.333 second response time [23:57:54] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.159 second response time [23:57:54] PROBLEM - Apache HTTP on mw2091 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.147 second response time [23:57:55] PROBLEM - Apache HTTP on mw1134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [23:57:55] PROBLEM - Apache HTTP on mw2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.147 second response time [23:57:56] PROBLEM - HHVM rendering on mw1072 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.017 second response time [23:57:56] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 2.171 second response time [23:57:57] PROBLEM - Apache HTTP on mw2193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.140 second response time [23:58:01] any specific information about ETA etc we can put it, or should it be sth generic? [23:58:03] PROBLEM - Apache HTTP on mw2032 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.140 second response time [23:58:04] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [23:58:04] PROBLEM - HHVM rendering on mw2032 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.141 second response time [23:58:04] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.777 second response time [23:58:04] PROBLEM - HHVM rendering on mw2026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.139 second response time [23:58:04] PROBLEM - HHVM rendering on mw2106 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.142 second response time [23:58:04] PROBLEM - HHVM rendering on mw2167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.139 second response time [23:58:05] PROBLEM - Apache HTTP on mw1219 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [23:58:05] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.280 second response time [23:58:06] RECOVERY - Apache HTTP on mw2166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.118 second response time [23:58:23] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.180 second response time [23:58:23] RECOVERY - HHVM rendering on mw1157 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 6.928 second response time [23:58:23] RECOVERY - HHVM rendering on mw2077 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 0.501 second response time [23:58:23] PROBLEM - HHVM rendering on mw1185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.015 second response time [23:58:23] never ask IT for an ETA on when it'll be up [23:58:24] PROBLEM - HHVM rendering on mw2069 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.144 second response time [23:58:24] PROBLEM - HHVM rendering on mw2057 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.145 second response time [23:58:24] PROBLEM - HHVM rendering on mw2076 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.142 second response time [23:58:27] HaeB: it's supposed to be coming back up [23:58:33] RECOVERY - HHVM rendering on mw2136 is OK: HTTP OK: HTTP/1.1 200 OK - 72492 bytes in 3.844 second response time [23:58:33] PROBLEM - Apache HTTP on mw1165 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [23:58:33] RECOVERY - HHVM rendering on mw1058 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.243 second response time [23:58:34] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [23:58:34] PROBLEM - HHVM rendering on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [23:58:34] RECOVERY - HHVM rendering on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 72495 bytes in 0.219 second response time [23:58:34] RECOVERY - Apache HTTP on mw2077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.316 second response time [23:58:35] PROBLEM - HHVM rendering on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.015 second response time [23:58:35] PROBLEM - HHVM rendering on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [23:58:36] PROBLEM - HHVM rendering on mw2068 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.144 second response time [23:58:36] RECOVERY - Apache HTTP on mw2052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.275 second response time [23:58:36] "All our sites are currently experiencing problems. Our engineers are working to fix the issue." [23:58:37] RECOVERY - Apache HTTP on mw2095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.917 second response time [23:58:37] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.471 second response time [23:58:38] HaeB: the site just loaded for me, actually. [23:58:38] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.117 second response time [23:58:43] RECOVERY - HHVM rendering on mw1149 is OK: HTTP OK: HTTP/1.1 200 OK - 72488 bytes in 4.201 second response time [23:58:43] RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 72487 bytes in 0.391 second response time [23:58:43] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [23:58:43] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 1.100 second response time [23:58:43] RECOVERY - HHVM rendering on mw2176 is OK: HTTP OK: HTTP/1.1 200 OK - 72491 bytes in 1.097 second response time [23:58:44] RECOVERY - HHVM rendering on mw2204 is OK: HTTP OK: HTTP/1.1 200 OK - 72492 bytes in 1.772 second response time [23:58:52] MatmaRex: i'm still getting 503s [23:58:57] yes, it works again for me [23:59:01] yes [23:59:04] up again [23:59:27] thanks, everyone [23:59:37] \o/ [23:59:46] *reload* ok, now for me too [23:59:49] wheee, it's alive!