[00:00:56] I'm going to go ahead with my patch for now [00:02:11] it's in an extension anyway so it's not like we can't have jenkins do both at once [00:02:46] and my patch is kind of important, though it's important enough that I could do it after the window anyway [00:03:23] oh, it is after the window. 00:03 [00:03:29] Krenair: nope [00:03:38] i've removed that in a patch which MaxSem has already +2ed [00:03:52] the references in the MF extension are removed on master? [00:04:03] the references are inert [00:04:07] yeah [00:04:09] okay [00:04:12] we just add one to extension.json and document it [00:04:16] but never do anything with it :) [00:04:17] (03CR) 10Alex Monk: [C: 032] Cleanup deprecated MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312339 (owner: 10Jdlrobson) [00:04:23] that's what I thought [00:04:24] * greg-g heads homeward [00:04:34] (removed in https://gerrit.wikimedia.org/r/#/c/312343/) [00:04:40] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:05:53] (03PS3) 10Alex Monk: Cleanup deprecated MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312339 (owner: 10Jdlrobson) [00:06:40] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:07:01] (my patch is on mw1099) [00:07:18] (my patch works) [00:08:29] !log krenair@tin Synchronized php-1.28.0-wmf.20/extensions/FlaggedRevs/business/RevisionReviewForm.php: https://gerrit.wikimedia.org/r/#/c/312423/ (duration: 00m 48s) [00:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:06] (verified) [00:11:08] ugh, right, gj jenkins [00:11:18] it had V+2'd jdlrobson's patch but not submitted it [00:11:41] (03PS1) 10Thcipriani: Scap: Bump installed version to 3.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/312437 [00:12:19] (03CR) 10Thcipriani: [C: 04-1] "Needs new packages on carbon" [puppet] - 10https://gerrit.wikimedia.org/r/312437 (owner: 10Thcipriani) [00:13:02] jdlrobson, your patch is on mw1099 [00:13:12] Krenair: on it [00:15:03] Krenair: looks good [00:15:32] sync away [00:15:52] (03CR) 10Jdlrobson: Blacklist minerva from showing Related Articles in the footer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311197 (https://phabricator.wikimedia.org/T144912) (owner: 10Bmansurov) [00:15:54] syncing mobile.php first [00:16:21] !log krenair@tin Synchronized wmf-config/mobile.php: https://gerrit.wikimedia.org/r/312339 (duration: 00m 47s) [00:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:52] now IS [00:17:22] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/312339 (duration: 00m 48s) [00:17:23] jdlrobson, ^ [00:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:59] thanks Krenair ! [00:18:59] SWAT is done [00:19:19] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:47] there'll be no SWAT next week due to the ops offsite, so the next is 2016-10-03 13:00-14:00 UTC [00:19:51] jouncebot, next [00:19:51] In 252 hour(s) and 40 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161003T1300) [00:20:21] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:21:50] ^ don't know what's up with mw1224 [00:23:06] apache seems stuck there [00:24:21] couple of hhvm processes stuck at 100% cpu [00:25:39] tempted to just restart apache but I don't want to prevent investigation of what's wrong [00:33:43] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:35:38] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:38:22] !log mw1224 apache stuck, not restarting for now in case someone wants to investigate later. possibly T89912? [00:38:23] T89912: HHVM lock-ups - https://phabricator.wikimedia.org/T89912 [00:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:58:19] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:00:10] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:38:31] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2661163 (10GWicke) [01:44:13] (03PS3) 10BBlack: upload storage: transition cp3046+cp3047 [puppet] - 10https://gerrit.wikimedia.org/r/312306 (https://phabricator.wikimedia.org/T145661) [01:44:20] (03CR) 10BBlack: [C: 032 V: 032] upload storage: transition cp3046+cp3047 [puppet] - 10https://gerrit.wikimedia.org/r/312306 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [01:49:22] !log depooled mw1224 service apache2 [01:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:55:20] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [01:57:52] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:10:10] !log mw1206, mw1224 - restarted hhvm and apache [02:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:10:30] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 75236 bytes in 0.174 second response time [02:10:59] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 75228 bytes in 0.167 second response time [02:11:00] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 27 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [02:11:00] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.045 second response time [02:12:19] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.594 second response time [02:12:41] 06Operations, 07HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#2661244 (10Dzahn) 19:10 < mutante> !log mw1206, mw1224 - restarted hhvm and apache 19:10 < icinga-wm> RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 75236 bytes in 0.174 second response time 19:11 < icinga-w... [02:13:28] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:13:33] !log maxsem@tin Synchronized php-1.28.0-wmf.20/extensions/SecurePoll/: https://gerrit.wikimedia.org/r/#/c/312450/1 (duration: 00m 51s) [02:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:01] PROBLEM - MegaRAID on db1060 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [02:39:54] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 17m 04s) [02:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:51] (03PS1) 10Alex Monk: tcpircbot: update comment detailing each IP in the ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/312454 [02:43:30] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:45:22] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [02:46:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Sep 23 02:46:04 UTC 2016 (duration 6m 10s) [02:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:50:15] (03PS1) 10Alex Monk: tcpircbot: Follow-up Ide89c59f: Update ferm rules too [puppet] - 10https://gerrit.wikimedia.org/r/312455 [02:50:17] (03PS1) 10Alex Monk: tcpircbot: remove localhost from ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/312456 [02:50:19] (03PS1) 10Alex Monk: tcpircbot: Follow-up Ide89c59f: Fix missing CIDR prefix on puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/312457 [02:52:30] (03CR) 10Alex Monk: "it turns out that this is not really necessary, but does make it consistent" [puppet] - 10https://gerrit.wikimedia.org/r/312457 (owner: 10Alex Monk) [02:52:56] (03CR) 10Alex Monk: ">>> import netaddr" [puppet] - 10https://gerrit.wikimedia.org/r/312457 (owner: 10Alex Monk) [02:55:03] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:07:59] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [03:50:11] (03PS3) 10BBlack: upload storage: finish esams (cp3048+cp3049) [puppet] - 10https://gerrit.wikimedia.org/r/312307 (https://phabricator.wikimedia.org/T145661) [03:50:17] (03CR) 10BBlack: [C: 032 V: 032] upload storage: finish esams (cp3048+cp3049) [puppet] - 10https://gerrit.wikimedia.org/r/312307 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [04:02:16] !log aaron@tin Synchronized php-1.28.0-wmf.20/includes/libs/rdbms: 5af1b93db1bb3d14844c55e4e3ed17fe963de551 (duration: 00m 51s) [04:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:03:17] !log aaron@tin Synchronized php-1.28.0-wmf.20/includes/deferred: 5af1b93db1bb3d14844c55e4e3ed17fe963de551 (duration: 00m 48s) [04:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:58:52] 06Operations, 10ops-esams, 10DNS, 10Traffic, 10netops: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2659577 (10grin) (testing lurking on phabricator made me see this ;-)) my 2'cents: since defgw was not pingable I'd check (apart from arp) irqs on the machine, I suspect you've checked... [05:59:58] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:01:33] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2661382 (10Naveenpf) @Aklapper I know this phabricator ticket was opened for simple change from url forward to giving proper ip address to the web... [06:09:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:11:42] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [06:21:29] (03PS7) 10Giuseppe Lavagetto: scap: introduce scap_source type [puppet] - 10https://gerrit.wikimedia.org/r/308973 [06:22:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "Tested in beta, will apply with care" [puppet] - 10https://gerrit.wikimedia.org/r/308973 (owner: 10Giuseppe Lavagetto) [06:24:40] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:29:40] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:31:30] <_joe_> mira is me [06:32:10] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:32:27] (03CR) 10Jcrespo: "Needs manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [06:36:01] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:39:52] 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2661404 (10Marostegui) [06:41:05] ACKNOWLEDGEMENT - MegaRAID on db1060 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Marostegui https://phabricator.wikimedia.org/T146449 [06:42:50] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4212377 keys - replication_delay is 0 [06:48:50] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[debian-goodies],Package[atop] [07:01:03] (03PS1) 10Jcrespo: mariadb: merge beta's 305668, 310360 after refactoring; fix dbstore [puppet] - 10https://gerrit.wikimedia.org/r/312471 [07:01:23] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:49] !log rebooting francium for kernel security update [07:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:05:26] (03CR) 10Jcrespo: [C: 032] mariadb: merge beta's 305668, 310360 after refactoring; fix dbstore [puppet] - 10https://gerrit.wikimedia.org/r/312471 (owner: 10Jcrespo) [07:12:00] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:12] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:33:59] !log executed 'find /var/log/hhvm/ -type f -user root -exec chown www-data:www-data {} \;' for all the api and appservers to remove/prevent cronspam (root:adm files also related to new reimaged hosts, Rsyslog needs to be configured before hhvm) - T132324 [07:34:01] T132324: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324 [07:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:40:30] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:09:32] (03CR) 10Jcrespo: [C: 04-1] "See 312471." [puppet] - 10https://gerrit.wikimedia.org/r/310360 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [08:09:49] (03CR) 10Jcrespo: [C: 04-1] "See 312471" [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [08:13:46] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2661528 (10Aklapper) The `Tags` above mentioned #Operations and #WMF-Legal. [08:30:18] !log upgrading varnishkafka to 1.0.12-1 in cache:maps [08:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:38:54] (03PS1) 10Filippo Giunchedi: scap: update to 3.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/312480 (https://phabricator.wikimedia.org/T127762) [08:44:38] !log depooled nginx restart on cp4003 and cp1045 for libssl upgrade [08:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:44] 06Operations, 06Labs, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661551 (10doctaxon) [08:45:27] 06Operations, 06Labs, 10Tool-Labs: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661564 (10doctaxon) p:05Triage>03High [08:46:12] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:52:11] !log upgrading varnishkafka to 1.0.12-1 in cache:misc [08:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:09] 06Operations, 10DBA: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T129452#2661591 (10jcrespo) [08:55:42] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2661595 (10jcrespo) [08:55:45] 06Operations, 10DBA: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T129452#2106136 (10jcrespo) [08:57:31] 06Operations, 10Analytics, 10Traffic: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#2661614 (10elukey) T138747 upgraded Varnishkafka to a new version able to start at any time and poll periodically the Varnish shm logs to see if they are open or... [09:06:34] !log reboot eventlog2001.codfw.wmnet for kernel upgrades [09:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:08:46] !log reimaging mira to jessie [09:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:44] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:11:20] <_joe_> moritzm: uhm that's going to be "interesting" [09:12:43] 06Operations, 10ops-codfw: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455#2661648 (10jcrespo) [09:13:19] 06Operations, 10ops-codfw: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455#2661664 (10jcrespo) [09:13:21] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2661663 (10jcrespo) [09:14:51] (03PS1) 10Jcrespo: mariadb: Add custom mysqld_safe to es200[1234] [puppet] - 10https://gerrit.wikimedia.org/r/312485 (https://phabricator.wikimedia.org/T145378) [09:15:09] Is the jobrunner broken? [09:15:14] https://grafana.wikimedia.org/dashboard/db/ores-extension [09:15:25] No jobs in the last ten hours [09:15:40] _joe_: why? this has been tested by releng extensively [09:17:41] <_joe_> moritzm: you're reimaging in codfw [09:17:51] <_joe_> it will be a test of our current puppet infra [09:18:18] ah, ok :-) I'll let you know if anything is strange, so far all looks fine [09:18:32] (03PS2) 10Jcrespo: mariadb: Add custom mysqld_safe to es200[1234] [puppet] - 10https://gerrit.wikimedia.org/r/312485 (https://phabricator.wikimedia.org/T145378) [09:18:40] <_joe_> if the puppet cert got signed, please tell me [09:18:53] <_joe_> so that I can check it is signed by the correct server [09:19:08] ok, will do [09:19:39] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/312485 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [09:19:57] (03PS3) 10Jcrespo: mariadb: Add custom mysqld_safe to es200[1234] [puppet] - 10https://gerrit.wikimedia.org/r/312485 (https://phabricator.wikimedia.org/T145378) [09:20:04] <_joe_> Amir1: https://grafana.wikimedia.org/dashboard/db/job-queue-health shows and increased error count since yesterday [09:20:32] <_joe_> I guess that's not what is happening there [09:20:45] They are not even gets started [09:20:45] <_joe_> it looks like it stopped reporting done jobs at the time [09:21:00] started at the deployment of wmf.20 [09:21:02] <_joe_> that graph just says "no data since" [09:21:16] <_joe_> Amir1: I'll take a look at the logs in a few [09:21:21] thanks [09:23:00] (03CR) 10Jcrespo: [C: 032] mariadb: Add custom mysqld_safe to es200[1234] [puppet] - 10https://gerrit.wikimedia.org/r/312485 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [09:25:24] <_joe_> Amir1: what is the name of the ores jobs? [09:25:47] _joe_: ORESFetchScoreJob [09:26:40] <_joe_> Amir1: INFO: ORESFetchScoreJob Ладва,_Оттомар revid=80946250 extra_params={"precache":".... ,timestamp=1474537300,QueuePartition=rdb4-6380) t=1021 good [09:26:54] _joe_: mira puppet cert now signed [09:26:55] <_joe_> so it seems they're being processed [09:26:59] <_joe_> moritzm: thanks [09:27:35] (03PS1) 10Jcrespo: db1010: retire entry from dhcp install [puppet] - 10https://gerrit.wikimedia.org/r/312486 (https://phabricator.wikimedia.org/T129395) [09:28:15] _joe_: how many? [09:28:16] https://en.wikipedia.org/w/index.php?title=Special:RecentChanges&limit=500 [09:28:24] There is no highlighted row [09:28:32] there is usually 2 or 3 in every 50 [09:28:38] same for other wikis [09:28:57] <_joe_> Amir1: a lot from what I see [09:29:13] Last highlighted row from enwiki: (diff | hist) . . r Chewing gum‎; 20:08 . . (+15)‎ . . ‎2601:588:4201:3b50:38ea:5e32:16e2:c37e (talk)‎ (creator) (Tag: Visual edit) [09:29:26] <_joe_> I'm not saying it's working correctly, I'm just saying the jobqueue thinks it processes them correctly [09:29:50] yeah, I'm thinking there is something else broken [09:31:27] <_joe_> actually [09:31:30] <_joe_> let me see [09:32:11] <_joe_> the file wasn't rotated, so I was seeing just jobs from yesterday [09:32:51] https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-12h,mode:quick,to:now))&_a=(filters:!(),options:(darkTheme:!f),panels:!((col:1,id:Dashboards,panelIndex:1,row:1,size_x:12,size_y:1,type:visualization),(col:1,id:Events-Over-Time,panelIndex:2,row:2,size_x:9,size_y:2,type:visualization [09:32:51] ),(col:1,id:Event-Types,panelIndex:3,row:4,size_x:9,size_y:2,type:visualization),(col:10,id:Event-Level,panelIndex:4,row:4,size_x:3,size_y:2,type:visualization),(col:1,columns:!(type,level,wiki,host,message),id:Default-Events-List,panelIndex:5,row:6,size_x:12,size_y:25,sort:!('@timestamp',desc),type:search),(col:10,id:Top-20-Hosts,panelIndex:6,row:2,size_x:3 [09:32:51] ,size_y:2,type:visualization)),query:(query_string:(analyze_wildcard:!t,query:ORESFetchScoreJob)),title:default,uiState:(P-2:(vis:(legendOpen:!f)),P-3:(vis:(legendOpen:!f)),P-4:(vis:(legendOpen:!f)))) [09:32:59] * Amir1 curses [09:33:05] <_joe_> Amir1: lol [09:33:18] <_joe_> Amir1: yes, right now they start but they never complete [09:33:24] <_joe_> should we rollback? [09:33:35] <_joe_> which errors do you see? [09:33:43] http://bit.ly/2d2kFyh [09:33:53] Could not connect to server "rdb1005.eqiad.wmnet:6381" [09:34:09] Unable to connect to redis server rdb1007.eqiad.wmnet:6380. [09:34:12] and so on [09:34:24] <_joe_> that should just be a warning and it should retry [09:34:30] (03PS1) 10Ema: upload storage: transition cp4005+4006 [puppet] - 10https://gerrit.wikimedia.org/r/312488 (https://phabricator.wikimedia.org/T145661) [09:34:31] (I hope logstash is not sensitive data) [09:34:32] (03PS1) 10Ema: upload storage: transition cp4007+4013 [puppet] - 10https://gerrit.wikimedia.org/r/312489 (https://phabricator.wikimedia.org/T145661) [09:34:33] <_joe_> that's not even that common [09:34:34] (03PS1) 10Ema: upload storage: finish ulsfo (cp4014+cp4015) [puppet] - 10https://gerrit.wikimedia.org/r/312490 (https://phabricator.wikimedia.org/T145661) [09:34:47] <_joe_> Amir1: that's a red herring [09:34:55] okay [09:35:03] <_joe_> something else is going on, you start thousands of jobs per hour [09:35:15] <_joe_> and there we just see a few errors [09:35:49] <_joe_> I mean I see 20 errors in the last hour, am I looking at the wrong query maybe? [09:36:31] no, I think it's correcct [09:36:42] 400 errors in the last 12 hours [09:38:34] (03CR) 10Ema: [C: 032] upload storage: transition cp4005+4006 [puppet] - 10https://gerrit.wikimedia.org/r/312488 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [09:38:54] I run a maintenance script to fill the data for now [09:39:05] <_joe_> Amir1: will that work? [09:39:11] <_joe_> I mean if it uses the php code [09:39:19] <_joe_> i expect that to fail as well [09:39:29] It doesn't work with jobs [09:39:52] and if something more upstream (like storing in the db is broken) I would know [09:41:04] <_joe_> Amir1: I found something [09:41:06] <_joe_> Sep 23 09:40:11 mw1300 jobrunner[15132]: 2016-09-23T09:40:11+0000 ERROR: Runner loop 0 process in slot 19 gave status '0': [09:41:10] <_joe_> Sep 23 09:40:11 mw1300 jobrunner[15132]: json_decode() error (4): Syntax error: @todo more info [09:41:15] _joe_: I found too [09:41:25] ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki [09:41:25] Processing 50 revisions [09:41:25] Catchable fatal error: Argument 2 passed to ORES\Cache::processRevision() must be an instance of ORES\int, integer given, called in /srv/mediawiki/php-1.28.0-wmf.20/extensions/ORES/includes/Cache.php on line 37 and defined in /srv/mediawiki/php-1.28.0-wmf.20/extensions/ORES/includes/Cache.php on line 116 [09:41:36] that explains it [09:42:09] <_joe_> ok I found it as well [09:42:11] <_joe_> heh, yes [09:42:22] <_joe_> hashar: ping [09:42:37] I fix it right now [09:42:40] make a patch [09:42:46] <_joe_> hashar: we want to either rollback from wmf20 or release a patch [09:42:47] back port to wmf.20 [09:43:05] <_joe_> Amir1: I'll be happy to take a look, but someone with more mediawiki expertise might be needed [09:43:07] no, I think we should backport to wmf.20 and deploy right now [09:43:16] <_joe_> Amir1: I concur [09:43:18] _joe_: yea [09:43:19] h [09:43:31] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 49 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [09:43:44] <_joe_> ema: ^^ [09:43:57] the jobrunner json_decode() error has been going on for quite a while [09:43:57] <_joe_> I guess you already know, but one never knows [09:44:11] yeah that's me :) [09:44:27] the patches I sent earlier this week related to jobrunner/jobchron logging and the upgrade of jobrunner is related to it [09:44:31] previously we had nothing shown [09:44:32] and [09:45:08] MediaWiki /rpc/RunJobs.php returns the error in pretty format:eg whith the whole skin rendered. So the job output is pretty much useless [09:45:16] but we should get an exception logged in logstash [09:45:25] ORES\int, integer given [09:45:41] (03PS1) 10Marostegui: wmnet: Deleted db1010 entry [dns] - 10https://gerrit.wikimedia.org/r/312492 (https://phabricator.wikimedia.org/T129395) [09:45:45] Amir1: yeah if you get a patch for wmf.20 , I can push it [09:45:51] https://phabricator.wikimedia.org/T146461 [09:46:10] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:46:24] (I am not going to rant about an ORES/int type when PHP has a built-in integer :D ) [09:47:11] more seriously, I think we have exactly Zero monitoring / alarming regarding jobs [09:48:03] !log disabling alerts and shutting down db1010 in preparation for decommissioning T129395 [09:48:05] T129395: Decommission db1010 - https://phabricator.wikimedia.org/T129395 [09:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:22] (03PS2) 10Marostegui: wmnet: Delete db1010 entry [dns] - 10https://gerrit.wikimedia.org/r/312492 (https://phabricator.wikimedia.org/T129395) [09:49:30] I will write extensive CI tests from now on for ORES extension [09:49:42] (03CR) 10Jcrespo: [C: 032] db1010: retire entry from dhcp install [puppet] - 10https://gerrit.wikimedia.org/r/312486 (https://phabricator.wikimedia.org/T129395) (owner: 10Jcrespo) [09:49:50] (03PS2) 10Jcrespo: db1010: retire entry from dhcp install [puppet] - 10https://gerrit.wikimedia.org/r/312486 (https://phabricator.wikimedia.org/T129395) [09:50:33] hashar: the backport got merged: https://gerrit.wikimedia.org/r/#/c/312493/ [09:50:45] Do you want me to backport it or you do? [09:51:07] s/backport/deploy [09:53:38] Amir1: sorry was cleaning up some stuff in phabricator :D [09:53:45] I do it [09:53:47] :) [09:54:21] yeah lets push that [09:57:54] Amir1: rolling [09:58:07] !log rearmed keyholder on mira [09:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:20] hashar: I'm in tin [09:58:31] !log hashar@tin Synchronized php-1.28.0-wmf.20/extensions/ORES/includes/Cache.php: No int typehinting (causes jobs to crash) T146461 (duration: 00m 42s) [09:58:31] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 43 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [09:58:32] T146461: No ORES jobs are running since deployment of 1.28.0-wmf.20 - https://phabricator.wikimedia.org/T146461 [09:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:59:58] I was deploying, did git fetch and nothing showed up :D [10:00:09] I was confused [10:00:09] Amir1: done. So jobs should flow again [10:00:16] I did the git fetch :] [10:00:25] any clue how to vreify? [10:00:30] I need to run some maintenance script [10:00:36] that way it's possible [10:00:40] I looked at RunJobs.log on fluoarine but couldn't tell about failures [10:00:55] !log ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki [10:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:06] (03PS1) 10Elukey: First draft of the Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [10:01:10] It's working [10:01:10] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:01:18] hashar: https://grafana.wikimedia.org/dashboard/db/ores-extension [10:02:19] (03CR) 10jenkins-bot: [V: 04-1] First draft of the Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [10:03:32] I was checking the recent changes in enwiki no rows was highlighted I was like "now what" and realized I'm checking enwiki in beta cluster [10:03:35] (03PS1) 10Muehlenhoff: Drop explicit trusty installer config for mira [puppet] - 10https://gerrit.wikimedia.org/r/312496 [10:03:37] I almost had a stroke :D [10:03:49] but in enwiki it's okay [10:04:22] the impact is minimal isn't it ? [10:04:27] beside ores score not showing? [10:04:34] yeah [10:04:42] but I fill an incident report ASAP [10:04:47] feel free to mark https://phabricator.wikimedia.org/T146461 resolved once done [10:05:12] you can refer to https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki from last week [10:05:26] which has a yet to be filled actionable of "Add monitoring alarm for global and type of jobs errors" [10:05:32] and Add monitoring alarm for Account creation errors (Task T146090) [10:05:32] T146090: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 [10:05:35] which would pretty similar [10:05:36] !log ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=wikidatawiki (T146461) and for 'trwiki', 'plwiki', 'fawiki', 'nlwiki', 'ruwiki', 'ptwiki' [10:05:37] T146461: No ORES jobs are running since deployment of 1.28.0-wmf.20 - https://phabricator.wikimedia.org/T146461 [10:05:40] (03CR) 10Muehlenhoff: [C: 032] Drop explicit trusty installer config for mira [puppet] - 10https://gerrit.wikimedia.org/r/312496 (owner: 10Muehlenhoff) [10:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:54] awesome hashar, I use that [10:06:24] (03CR) 10Ema: [C: 032] upload storage: transition cp4007+4013 [puppet] - 10https://gerrit.wikimedia.org/r/312489 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [10:06:30] (03PS2) 10Ema: upload storage: transition cp4007+4013 [puppet] - 10https://gerrit.wikimedia.org/r/312489 (https://phabricator.wikimedia.org/T145661) [10:06:33] (03CR) 10Ema: [V: 032] upload storage: transition cp4007+4013 [puppet] - 10https://gerrit.wikimedia.org/r/312489 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [10:06:59] (03PS2) 10Elukey: First draft of the Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [10:10:32] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:38] Amir1: I think the key is the Jobrunner service does not report errors per job types (or I could not find the metric) [10:11:49] Amir1: AND we do not have Icinga checks for job failures (afaik) [10:12:02] so whenever some or all jobs screw up, we dont know [10:12:20] and realize about it after the fact, typically on our friday morning several hours after they got broken [10:12:40] last week it was all account creations being impossible for ~ 15 hours :( [10:12:43] yeah, I see. but we should have first responders per job type, e.g. if ORESFetchScore is failing it should ping me [10:13:35] https://grafana.wikimedia.org/dashboard/db/job-queue-health [10:13:45] _joe_: job queue is back to normal now [10:14:23] 06Operations: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#2661877 (10elukey) [10:14:33] 06Operations: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#2661892 (10elukey) p:05Normal>03Low [10:15:12] Amir1: yeah any jobs having a huge failure rate should basically cause an IRC bot to scream here / send mails etc [10:16:05] note how we have roughly 120 jobs per minutes failing [10:16:13] which has been going on for age [10:16:19] !log reimaging mira to jessie (again, previously installer config still pointed to trusty) [10:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:16:49] YEah [10:16:50] https://grafana.wikimedia.org/dashboard/db/ores-extension [10:16:52] and I dont know whether the failling jobs are discarded or retried later on [10:17:20] I monitor ores extension failure rate all the time, sometime it's around 1% sometime it's 10% (when redis is crazy) [10:17:44] yeah we have troubles with redis connections timing out [10:17:48] most probably due to load [10:17:58] or spikes of connections overflowing redis servers [10:18:05] hashar: depends, failed jobs have throttle and also a maximum number of failure (which is 30 by default) [10:19:18] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#2661909 (10hashar) [10:19:20] 06Operations: "Unable to connect to redis server" log spam - https://phabricator.wikimedia.org/T130078#2661911 (10hashar) [10:20:01] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#1996056 (10hashar) [10:20:16] 06Operations: "Unable to connect to redis server" log spam - https://phabricator.wikimedia.org/T130078#2124709 (10hashar) All content copy pasted to T125735 [10:23:07] <_joe_> Amir1: \o/ [10:23:19] (03PS3) 10Jcrespo: db1010: retire entry from dhcp install [puppet] - 10https://gerrit.wikimedia.org/r/312486 (https://phabricator.wikimedia.org/T129395) [10:23:19] <_joe_> sorry, was lost into writing documentation on-wiki [10:23:20] <_joe_> :P [10:23:39] :) [10:24:43] also total queue size is descending [10:34:04] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661922 (10jcrespo) Adding Traffic so they can give it a quick look. [10:37:01] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:15] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#2661944 (10hashar) [10:46:51] Amir1: good catch :] [10:57:05] thanks :) [11:05:43] (03PS2) 10Alexandros Kosiaris: puppetmaster/puppetdb: Make ferm rules better [puppet] - 10https://gerrit.wikimedia.org/r/312054 [11:08:29] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10Volans) I think that this kind of task could be part of our automation //framework// (TBD). Having it managed from a centralized place will help with: * moni... [11:17:44] (03PS1) 10Urbanecm: Enable subpages in 121 namespace in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312500 (https://phabricator.wikimedia.org/T146271) [11:17:57] (03CR) 10Mobrovac: [C: 031] role::deployment::server: fix scap3/trebuchet declarations [puppet] - 10https://gerrit.wikimedia.org/r/306440 (https://phabricator.wikimedia.org/T143692) (owner: 10Giuseppe Lavagetto) [11:19:03] (03CR) 10Giuseppe Lavagetto: [C: 031] "+1 but amend the change title: nothing about puppetDB here" [puppet] - 10https://gerrit.wikimedia.org/r/312054 (owner: 10Alexandros Kosiaris) [11:19:48] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661961 (10doctaxon) p:05High>03Unbreak! changed Priority because there have to run a lot of bot scripts Wikipedia users needs to work with it. The unbreak is open.... [11:25:18] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2661964 (10MoritzMuehlenhoff) @volans Ack, sounds good to me. [11:26:35] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661965 (10jcrespo) @doctaxon can you indicate the full url you are trying? [11:37:55] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661968 (10doctaxon) from chat about the topic: ``` 11:26 < wikibugs> Labs, Tool-Labs, Operations, Traffic: repeated 503 errors for 90 minutes now on... [11:40:23] (03CR) 10Mobrovac: [C: 031] scap: update to 3.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/312480 (https://phabricator.wikimedia.org/T127762) (owner: 10Filippo Giunchedi) [11:41:40] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661971 (10Steinsplitter) Getting the problem when accessing the Wikimedia Commons api via labs or labs grid engine. For example when attempting to getting image info... [11:43:23] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661551 (10Joe) @Steinsplitter do you get the data correctly if you try from your computer? [11:43:34] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661974 (10doctaxon) is the error related to the cache proxies, if there are reports of all the cp1065, cp 1053, cp 1055 ...? [11:44:43] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661975 (10Joe) >>! In T146451#2661974, @doctaxon wrote: > is the error related to the cache proxies, if there are reports of all the cp1065, cp 1053, cp 1055 ...? It... [11:47:34] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661977 (10doctaxon) Who is responsible for that? [11:52:41] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2661984 (10Steinsplitter) >>! In T146451#2661972, @Joe wrote: > @Steinsplitter do you get the data correctly if you try from your computer? Yes, ~ 40 successful attem... [11:53:53] (03PS2) 10Filippo Giunchedi: scap: update to 3.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/312480 (https://phabricator.wikimedia.org/T127762) [11:55:40] (03CR) 10Filippo Giunchedi: [C: 032] scap: update to 3.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/312480 (https://phabricator.wikimedia.org/T127762) (owner: 10Filippo Giunchedi) [11:56:09] moritzm mobrovac ^ [11:56:45] thnx godog! [12:00:38] np! [12:00:50] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2661993 (10Addshore) [12:00:56] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2653667 (10Addshore) [12:01:15] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2653667 (10Addshore) a:05Gehel>03Addshore [12:02:36] addshore: thanks! [12:02:53] no worries! I will file a ticket about moving them to diamond too! [12:03:34] <_joe_> !log rolling restart of mw1280-90, high cpu usage due to memory leaks. [12:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:03:53] <_joe_> that's hhvm on those servers, not rebooting the whole server, heh [12:06:03] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs - move metric collections to diamond - https://phabricator.wikimedia.org/T146468#2662014 (10Gehel) [12:06:19] addshore: I just created that ticket ^ [12:06:25] oh, haha! we just created dupes :) [12:06:30] I'll merge mine into yours! [12:06:34] addshore: damn, too fast! [12:06:46] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662029 (10doctaxon) next error trying this: https://de.wikipedia.org/w/index.php?title=Offshore-Windpark_Borssele&action=info ``` / format json / maxlag 5 / action q... [12:06:56] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs - move metric collections to diamond - https://phabricator.wikimedia.org/T146468#2662031 (10Addshore) [12:07:19] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 2 others: wdqs - move metric collections to diamond - https://phabricator.wikimedia.org/T146468#2662014 (10Addshore) [12:07:29] [= [12:08:03] =[ [12:08:17] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662035 (10Joe) @doctaxon do you get an error consistently for that url? if so, trying from where? I still can't reproduce your problem, that seems not to be limited t... [12:09:17] gehel: the updated script should be dpeloyed and start running with the next puppet run on stat1002 :) [12:09:19] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662036 (10doctaxon) no, it's not consistently but random, it's always API info up to now [12:09:27] addshore: thanks a lot! [12:20:31] !log rearmed keyholder on mira [12:24:42] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2662046 (10MoritzMuehlenhoff) mira is now running jessie. Please give it some more testing, for migrating tin, we could mira temporarily make the... [12:26:43] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662047 (10doctaxon) runs good for 8 minutes now [12:28:28] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662048 (10ema) I've tried reproducing the issue for a while without success. @Joe restarted mw1280-90 due to memory leaks, perhaps that helped? [12:31:12] 06Operations, 10ops-esams, 10DNS, 10Traffic, 10netops: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2662061 (10faidon) >>! In T146391#2661380, @grin wrote: > sorry for chiming in. :-) No reason to be sorry — thanks for the input! [12:31:26] (03PS2) 10Ema: upload storage: finish ulsfo (cp4014+cp4015) [puppet] - 10https://gerrit.wikimedia.org/r/312490 (https://phabricator.wikimedia.org/T145661) [12:31:35] (03CR) 10Ema: [C: 032 V: 032] upload storage: finish ulsfo (cp4014+cp4015) [puppet] - 10https://gerrit.wikimedia.org/r/312490 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [12:32:33] (03PS1) 10Hashar: logstash: parse runJobs messages [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) [12:33:25] (03CR) 10Hashar: "Bryan, I am not sure who else but you can review logstash filter and figure out how to properly test it. Maybe on beta? I am willing to " [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [12:36:53] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [12:39:25] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:51:19] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662098 (10doctaxon) Okay, I suppose, the problem has been solved. What have you done to solve it? [12:52:23] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 39 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [12:53:47] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662101 (10ema) @doctaxon: nothing, except for @Joe's restart of the HHVMs mentioned above. [12:54:22] (03PS1) 10Jgreen: move frauth1001.frack.eqiad.wmnet to beryllium's IP [dns] - 10https://gerrit.wikimedia.org/r/312507 (https://phabricator.wikimedia.org/T145101) [12:54:56] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:56:06] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662109 (10Joe) @doctaxon I tracked down `mw1203` and `mw1280-1290` as potential source of problems because of how much cpu/RAM they were consuming, and issued a rollin... [12:58:21] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662110 (10doctaxon) Top! Thank you very much! [12:58:52] jouncebot: next [12:58:52] In 240 hour(s) and 1 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161003T1300) [12:59:16] ah [12:59:25] looks like Friday European SWAT is gone :] [13:00:07] hashar: was friday eu swat a thing? [13:00:43] (03CR) 10DCausse: elasticsearch tool (032 comments) [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel) [13:01:03] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:02:43] (03CR) 10Gehel: elasticsearch tool (034 comments) [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel) [13:04:10] addshore: I believe? [13:04:20] * addshore doesnt remember! [13:05:34] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2662114 (10Joe) 05Open>03Resolved a:03Joe [13:11:36] I thought there weren't deploys on Fridays. [13:15:26] (03CR) 10Jgreen: [C: 032] move frauth1001.frack.eqiad.wmnet to beryllium's IP [dns] - 10https://gerrit.wikimedia.org/r/312507 (https://phabricator.wikimedia.org/T145101) (owner: 10Jgreen) [13:17:53] (03PS10) 10Gehel: elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 [13:20:28] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:20:42] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2662117 (10Krenair) >>! In T144578#2662046, @MoritzMuehlenhoff wrote: > mira is now running jessie. Please give it some more testing The next de... [13:22:17] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 732 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4228336 keys - replication_delay is 732 [13:24:48] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review: jobchron logs are not rotated - https://phabricator.wikimedia.org/T96132#2662123 (10hashar) 05Open>03Resolved I have confirmed both Trusty and Jessie properly logrotate both jobchron.log and jobrunner.log ``` mw1161$ ls -1 /var/log/mediawiki/*.log{... [13:25:27] 06Operations, 10MediaWiki-JobRunner, 07Beta-Cluster-reproducible, 13Patch-For-Review: wikidev people cant read /var/log/mediawiki/jobrunner.log - https://phabricator.wikimedia.org/T146040#2662136 (10hashar) 05Open>03Resolved I have confirmed both Trusty and Jessie properly logrotate both jobchron.log a... [13:26:47] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:32:17] 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2661404 (10Cmjohnson) Replaced the failed disk [13:34:40] (03PS1) 10BBlack: cache_upload: re-enable daily backend restarts [puppet] - 10https://gerrit.wikimedia.org/r/312510 (https://phabricator.wikimedia.org/T145661) [13:34:42] (03PS1) 10BBlack: cache_upload: storage experiment is the new normal [puppet] - 10https://gerrit.wikimedia.org/r/312511 (https://phabricator.wikimedia.org/T145661) [13:34:44] (03PS1) 10BBlack: cache_upload: removed unused hieradata key [puppet] - 10https://gerrit.wikimedia.org/r/312512 (https://phabricator.wikimedia.org/T145661) [13:35:02] (03PS3) 10Muehlenhoff: Create a new LDAP schema extension for custom user attributes [puppet] - 10https://gerrit.wikimedia.org/r/311694 (https://phabricator.wikimedia.org/T146102) [13:35:16] (03PS3) 10Alexandros Kosiaris: puppetmaster: Make ferm rules better [puppet] - 10https://gerrit.wikimedia.org/r/312054 [13:35:18] (03PS1) 10Alexandros Kosiaris: puppetdb: Only allow connection from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/312513 [13:36:48] 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2662173 (10Marostegui) Thanks - it is rebuilding now! ``` root@db1060:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL Rebuild Progress on Device at Enclosure 32, Slot 4 Completed 2% in 5 Minutes. Exit C... [13:37:30] (03CR) 10Muehlenhoff: [C: 032] Create a new LDAP schema extension for custom user attributes [puppet] - 10https://gerrit.wikimedia.org/r/311694 (https://phabricator.wikimedia.org/T146102) (owner: 10Muehlenhoff) [13:37:44] (03CR) 10jenkins-bot: [V: 04-1] puppetdb: Only allow connection from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/312513 (owner: 10Alexandros Kosiaris) [13:39:01] 06Operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#2662175 (10Aklapper) Does anyone plan to fix those pages? Or is this task rather low priority? [13:40:00] gehel: just wondering are those wdqs nginxs already up and running? [13:40:39] addshore: they should be [13:41:09] hmm, okay, may not be accessible form stat1002 on port 8888! [13:41:44] gehel: I'm guessing I need to open another version of https://phabricator.wikimedia.org/T120010 to cover codfw! [13:42:09] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:36] addshore: yes, probably... [13:42:50] (03CR) 10Ottomata: First draft of the Pivot UI's puppetization (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [13:44:02] ^fixing pollux [13:44:07] (03PS1) 10Muehlenhoff: Fix passing extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/312515 [13:45:27] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:46:05] (03CR) 10Muehlenhoff: [C: 032] Fix passing extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/312515 (owner: 10Muehlenhoff) [13:46:46] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2662197 (10Addshore) [13:47:09] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:48:45] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2662197 (10Addshore) [13:48:50] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2662221 (10Addshore) [13:48:53] 06Operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#2662225 (10jcrespo) > My opinion on this is that there could be data loss, but all of them in 2012 of before, it just happens that the page was "touched" recently. This makes this issue less of an imminent problem... [13:53:44] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2662242 (10Aklapper) >>! In T105905#2525620, @BBlack wrote: > The blog is still sending the response header: `strict-transport-security: max-age=86400`. It should be `strict-transport-... [13:56:17] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:56:18] PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:10] PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:22] 06Operations, 10Mail, 07LDAP, 13Patch-For-Review: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2662286 (10MoritzMuehlenhoff) @bbogaert : https://gerrit.wikimedia.org/r/#/c/311694/ (and the followup fix https://gerrit.wikimedia.org/r/#/c/312515/) were just merged. If... [14:01:51] (03PS2) 10BBlack: cache_upload: re-enable daily backend restarts [puppet] - 10https://gerrit.wikimedia.org/r/312510 (https://phabricator.wikimedia.org/T145661) [14:01:58] (03PS1) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 [14:02:00] (03PS1) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 [14:02:25] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: re-enable daily backend restarts [puppet] - 10https://gerrit.wikimedia.org/r/312510 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [14:02:39] (03CR) 10jenkins-bot: [V: 04-1] Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris) [14:02:45] (03PS1) 10Hashar: contint: unmount /mnt/home/jenkins-deploy/tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/312518 (https://phabricator.wikimedia.org/T146381) [14:03:09] (03CR) 10jenkins-bot: [V: 04-1] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [14:03:27] (03PS2) 10BBlack: cache_upload: storage experiment is the new normal [puppet] - 10https://gerrit.wikimedia.org/r/312511 (https://phabricator.wikimedia.org/T145661) [14:03:32] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: storage experiment is the new normal [puppet] - 10https://gerrit.wikimedia.org/r/312511 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [14:04:35] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: Make ferm rules better [puppet] - 10https://gerrit.wikimedia.org/r/312054 (owner: 10Alexandros Kosiaris) [14:04:40] (03PS4) 10Alexandros Kosiaris: puppetmaster: Make ferm rules better [puppet] - 10https://gerrit.wikimedia.org/r/312054 [14:04:45] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Make ferm rules better [puppet] - 10https://gerrit.wikimedia.org/r/312054 (owner: 10Alexandros Kosiaris) [14:09:55] (03PS1) 10Hashar: contint: move to ::srv role from ::mnt [puppet] - 10https://gerrit.wikimedia.org/r/312519 (https://phabricator.wikimedia.org/T146381) [14:10:06] (03PS1) 10Alexandros Kosiaris: puppetmaster: Include base::firewall on frontends [puppet] - 10https://gerrit.wikimedia.org/r/312520 [14:10:27] (03PS2) 10Alexandros Kosiaris: puppetmaster: Include base::firewall on frontends [puppet] - 10https://gerrit.wikimedia.org/r/312520 [14:10:29] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:10:30] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Include base::firewall on frontends [puppet] - 10https://gerrit.wikimedia.org/r/312520 (owner: 10Alexandros Kosiaris) [14:11:56] (03PS2) 10BBlack: cache_upload: removed unused hieradata key [puppet] - 10https://gerrit.wikimedia.org/r/312512 (https://phabricator.wikimedia.org/T145661) [14:12:02] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: removed unused hieradata key [puppet] - 10https://gerrit.wikimedia.org/r/312512 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [14:17:09] PROBLEM - puppet last run on mc2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:18:38] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:19:39] (03PS1) 10Muehlenhoff: Decomission mw1217 [puppet] - 10https://gerrit.wikimedia.org/r/312522 (https://phabricator.wikimedia.org/T138925) [14:24:38] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:24:47] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacement frauth1001 - https://phabricator.wikimedia.org/T143902#2662390 (10Jgreen) 05Open>03Resolved [14:26:10] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2662393 (10GWicke) [14:27:48] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:29:17] (03PS1) 10Hashar: contint: move from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) [14:29:38] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:30:54] (03PS2) 10Hashar: contint: move from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) [14:36:40] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:40:00] (03PS3) 10Jcrespo: wmnet: Delete db1010 entry [dns] - 10https://gerrit.wikimedia.org/r/312492 (https://phabricator.wikimedia.org/T129395) (owner: 10Marostegui) [14:41:56] (03Abandoned) 10Hashar: contint: remove obsolete files { ensure => absent } [puppet] - 10https://gerrit.wikimedia.org/r/312275 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:41:58] (03Abandoned) 10Hashar: contint: migrate castor server to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312322 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:42:00] (03Abandoned) 10Hashar: contint: mount /srv the same as /mnt [puppet] - 10https://gerrit.wikimedia.org/r/312262 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:42:04] (03Abandoned) 10Hashar: contint: add a tmpfs on /srv [puppet] - 10https://gerrit.wikimedia.org/r/312328 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:42:06] (03Abandoned) 10Hashar: contint: migrate browsertest redis to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312267 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:42:09] (03Abandoned) 10Hashar: contint: create /srv based directory hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/312313 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:42:12] (03Abandoned) 10Hashar: contint: migrate package_builder from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312270 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:42:14] (03Abandoned) 10Hashar: contint: unmount /mnt/home/jenkins-deploy/tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/312518 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:42:17] (03Abandoned) 10Hashar: contint: move to ::srv role from ::mnt [puppet] - 10https://gerrit.wikimedia.org/r/312519 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [14:43:19] RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:43:49] (03CR) 10Hashar: "This was a quick rush for the beta cluster so we can add a deployment server as a jenkins slave. The reason was a conflict between /mnt a" [puppet] - 10https://gerrit.wikimedia.org/r/311959 (owner: 10Hashar) [14:44:09] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:47:12] (03PS3) 10Hashar: contint: move from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) [14:50:55] (03CR) 10Jcrespo: [C: 032] wmnet: Delete db1010 entry [dns] - 10https://gerrit.wikimedia.org/r/312492 (https://phabricator.wikimedia.org/T129395) (owner: 10Marostegui) [14:55:39] !log deployed dns update (removing db1010) T129395 [14:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:46] T129395: Decommission db1010 - https://phabricator.wikimedia.org/T129395 [15:02:31] 06Operations, 10ops-eqiad: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455#2662485 (10jcrespo) p:05Triage>03Low [15:03:00] (03PS1) 10Muehlenhoff: udp2log: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/312525 [15:04:37] 06Operations, 10ops-eqiad: Decommission db1010 - https://phabricator.wikimedia.org/T129395#2662488 (10jcrespo) [15:05:00] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2662491 (10BBlack) nginx has added `max_conns` to the open source master branch in http://hg.nginx.org/nginx/rev/29bf0dbc0a77 , which should appear in 1.11.5. That wasn... [15:06:46] 06Operations, 10ops-eqiad: Decommission db1010 - https://phabricator.wikimedia.org/T129395#2662493 (10jcrespo) a:05jcrespo>03None [15:08:47] 06Operations, 10DBA: db1034 lag - https://phabricator.wikimedia.org/T139280#2662497 (10jcrespo) This was not hardware related. However, it is on the list of soon-to decom servers. Stealing it for now. [15:09:25] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2662499 (10jcrespo) [15:09:54] 06Operations, 10DBA: db1034 lag - https://phabricator.wikimedia.org/T139280#2662501 (10jcrespo) [15:09:56] 06Operations, 10ops-eqiad: Decommission db1010 - https://phabricator.wikimedia.org/T129395#2662502 (10jcrespo) [15:10:00] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10jcrespo) [15:10:16] 06Operations, 10DBA: db1034 decommission - https://phabricator.wikimedia.org/T139280#2425354 (10jcrespo) [15:11:06] (03CR) 10Anomie: [C: 031] Add 'message-format' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312404 (https://phabricator.wikimedia.org/T146416) (owner: 10Gergő Tisza) [15:14:38] (03PS1) 10Muehlenhoff: Tools proxy: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/312527 [15:19:16] (03PS11) 10Gehel: elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 [15:26:58] (03PS2) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 [15:27:00] (03PS2) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 [15:27:26] (03CR) 10jenkins-bot: [V: 04-1] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [15:27:35] (03CR) 10jenkins-bot: [V: 04-1] Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris) [15:30:02] (03PS1) 10Jcrespo: labsdb1002: remove from dhcp install server config [puppet] - 10https://gerrit.wikimedia.org/r/312528 (https://phabricator.wikimedia.org/T146455) [15:32:30] (03PS2) 10Alexandros Kosiaris: puppetdb: Only allow connection from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/312513 [15:32:56] (03CR) 10Gehel: elasticsearch tool (031 comment) [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel) [15:33:52] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2662562 (10RobH) [15:33:56] (03PS3) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 [15:34:30] (03CR) 10jenkins-bot: [V: 04-1] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [15:34:39] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:20] (03CR) 10BryanDavis: "> Bryan, I am not sure who else but you can review logstash filter" [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:37:43] (03PS4) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 [15:37:45] (03PS3) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 [15:38:19] (03CR) 10jenkins-bot: [V: 04-1] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [15:42:22] (03CR) 10BryanDavis: "It might be a better long term solution to adjust the use of the PSR3 logger in JobRunner.php to actually generate structured log messages" [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:44:13] (03PS5) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 [15:49:52] (03PS2) 10Alexandros Kosiaris: tcpircbot: update comment detailing each IP in the ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/312454 (owner: 10Alex Monk) [15:49:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] tcpircbot: update comment detailing each IP in the ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/312454 (owner: 10Alex Monk) [15:50:15] (03PS2) 10Alexandros Kosiaris: tcpircbot: Follow-up Ide89c59f: Update ferm rules too [puppet] - 10https://gerrit.wikimedia.org/r/312455 (owner: 10Alex Monk) [15:50:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] tcpircbot: Follow-up Ide89c59f: Update ferm rules too [puppet] - 10https://gerrit.wikimedia.org/r/312455 (owner: 10Alex Monk) [15:50:41] (03CR) 10Alexandros Kosiaris: [C: 032] tcpircbot: remove localhost from ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/312456 (owner: 10Alex Monk) [15:50:44] (03PS2) 10Alexandros Kosiaris: tcpircbot: remove localhost from ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/312456 (owner: 10Alex Monk) [15:50:47] (03CR) 10Alexandros Kosiaris: [V: 032] tcpircbot: remove localhost from ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/312456 (owner: 10Alex Monk) [15:51:06] (03PS2) 10Alexandros Kosiaris: tcpircbot: Follow-up Ide89c59f: Fix missing CIDR prefix on puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/312457 (owner: 10Alex Monk) [15:51:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] tcpircbot: Follow-up Ide89c59f: Fix missing CIDR prefix on puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/312457 (owner: 10Alex Monk) [15:52:41] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2662606 (10Tbayer) >>! In T105905#2463813, @Tbayer wrote: >>>! In T105905#2463476, @faidon wrote: >> After my mail ysterday, Jeff Elder contacted me for clarifications (which I gave).... [15:53:18] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2662607 (10Tbayer) a:05Tbayer>03None [15:53:41] (03CR) 10Ottomata: [C: 031] udp2log: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/312525 (owner: 10Muehlenhoff) [15:55:32] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:26] (03CR) 10Giuseppe Lavagetto: [C: 031] "Small comment, see what you think." (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris) [15:59:51] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Enable future parser based on .future file (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [16:04:40] (03PS12) 10Gehel: elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 [16:12:15] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4191675 keys - replication_delay is 0 [16:17:31] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2662678 (10BBlack) [16:21:01] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:24:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:31:05] ^ that's real, some emerging pattern of 500s [16:31:26] wasn't there some mw appserver issues the last day or two? [16:32:12] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:33:30] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:34:28] only thing interesting/frequent in fatal log: https://logstash.wikimedia.org/goto/7ad0830e77538f1e0eed22fd64aa5dbe [16:41:30] PROBLEM - puppet last run on stat1002 is CRITICAL: Connection refused by host [16:41:32] PROBLEM - DPKG on stat1002 is CRITICAL: Connection refused by host [16:42:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:42:29] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:42:49] PROBLEM - configured eth on stat1002 is CRITICAL: Connection refused by host [16:43:10] PROBLEM - MegaRAID on stat1002 is CRITICAL: Connection refused by host [16:43:43] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:58:04] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 658 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4196552 keys - replication_delay is 658 [17:03:15] !log stat1002 - starting nagios-nrpe-server [17:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:23] RECOVERY - configured eth on stat1002 is OK: OK - interfaces up [17:03:44] RECOVERY - MegaRAID on stat1002 is OK: OK: optimal, 1 logical, 12 physical [17:04:23] !log stat1002 - before it was hanging and then fixed due to https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Fixing_HDFS_mount_at_.2Fmnt.2Fhdfs [17:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:39] RECOVERY - DPKG on stat1002 is OK: All packages OK [17:04:41] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [17:05:21] RECOVERY - Disk space on stat1002 is OK: DISK OK [17:05:21] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:20:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:20:51] (03PS1) 10Dereckson: Update The Ash Tree to fr. and en.planet [puppet] - 10https://gerrit.wikimedia.org/r/312536 [17:22:02] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:23:01] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:23:36] mira has Improperly owned (0:0) files in /srv/mediawiki-staging again [17:24:01] oh, i guess the reinstall [17:24:19] what is this error talking about? https://logstash.wikimedia.org/goto/7ad0830e77538f1e0eed22fd64aa5dbe [17:24:32] probably what caused the alert on fatals [17:26:47] Hello. This is https://phabricator.wikimedia.org/T138036 [17:27:17] i got Courier Fetch Error: unhandled courier request error: Service Unavailable on that link [17:31:15] !log ebernhardson@tin Synchronized php-1.28.0-wmf.20/extensions/CirrusSearch/includes/CompletionSuggester.php: Add timing marks to narrow down autocomplete timing regression (duration: 18m 43s) [17:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:08] (03PS1) 10Catrope: Enable Flow beta feature on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312537 (https://phabricator.wikimedia.org/T144384) [17:34:26] did somebody manually fix mira or did it fix itself just now? [17:34:34] doesnt see the root-owned files anymore now [17:34:50] waits for a recovery though .. [17:35:05] it already did recover.. hello icinga-wm ? [17:35:12] !log ebernhardson@tin Synchronized php-1.28.0-wmf.20/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: Add timing marks to narrow down autocomplete timing regression (duration: 00m 50s) [17:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:09] mutante: i just synced out a patch with scap, which could have done something to trigger recovery? [17:36:13] so I'm guessing the sync-file which triggered a co-master sync changed a lot about /srv/mediawiki-staging on mira [17:36:30] interesting. ok, that explains it at least [17:36:59] but icinga-wm could have told me [17:38:01] so, the only problem with mira currently is that the l10nupdate user has a uid of 1001, whereas on tin l10nupdate's uid is 10002 [17:38:29] which is causing some crazy output from scap about cdb rebuild permission denied. [17:39:30] 10002 is correct [17:39:38] seems like we fixed this before the reinstall too? [17:39:47] or similar [17:40:12] yeah, dunno how the new mira got 1001 for l10nupdate. [17:40:26] i can fix it [17:41:16] the GID is 10002 as it should be [17:41:59] yup, looks like it should be fine now :) [17:42:16] hold on [17:42:27] gotta run a find -exec [17:43:16] !log mira - changing UID of l10nupdate to 10002, chown'ing files (1001 -> 10002) [17:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:31] thcipriani: now..should be done [17:43:48] cool, I'm going to try a sync-file on a README and see what happens [17:44:36] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2662861 (10RobH) [17:46:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:46:35] !log thcipriani@tin Synchronized README: Test sync for new mira (duration: 01m 27s) [17:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:46:56] no errors in scap output this time. [17:47:07] thanks mutante ! [17:49:04] yw :) [17:53:57] (03CR) 10Dzahn: [C: 032] Update The Ash Tree to fr. and en.planet [puppet] - 10https://gerrit.wikimedia.org/r/312536 (owner: 10Dereckson) [17:54:35] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2662915 (10RobH) [17:55:16] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2652695 (10RobH) a:05RobH>03Joe These are ready for testing use. Please update/resolve this task accordingly once you are aware. Thanks! [17:56:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:56:39] mutante: don't forget to restart cron after changing the uid for l10nupdate [17:57:17] bd808: ! oh, right. thanks for the reminder, done! [17:57:23] !log mira restarted cron [17:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:13] mutante: yw. I think we only got burned by that like 3 times before :) [17:58:20] :p yes [17:58:32] brion: your blog is down [18:00:01] @seen hexmode [18:00:01] mutante: Last time I saw hexmode they were joining the channel, they are still in the channel #wmhack at 9/22/2016 9:39:21 PM (20h20m40s ago) [18:02:50] heh [18:02:55] lemme check [18:03:11] mutante: looks ok to me, what's wrong with it that you see? [18:05:00] brion: i saw planet.runner:Error 503 while updating feed http://leuksman.com/log/category/wiki/feed/ [18:05:14] brion: and then i tried opening leuksman.com in browser and it doesnt load at all [18:05:22] mutante: i think leuksman.com's been dead for some months at least, i moved everything to brionv.com [18:05:37] brion: oh, then let's update that in planet config [18:05:40] whee [18:06:06] uses https://brionv.com/log/feed/ [18:07:34] (03CR) 10Hashar: "I fully agree on going with adjusting MediaWiki to properly format from the source instead of print --> syslog -> grok :D I was a bit la" [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [18:08:31] (03PS1) 10Dzahn: planet: update some moved feed URLs [puppet] - 10https://gerrit.wikimedia.org/r/312540 [18:09:27] (03PS2) 10Dzahn: planet: update some moved feed URLs [puppet] - 10https://gerrit.wikimedia.org/r/312540 [18:09:44] (03CR) 10Dzahn: [C: 032] planet: update some moved feed URLs [puppet] - 10https://gerrit.wikimedia.org/r/312540 (owner: 10Dzahn) [18:10:23] anyone familiar with api application servers load balancing? i'm investigating a large rise in p95 time for an api call, i happened to notice that mw1283-90 in ganglia report very high cpu usage compared to the rest of them: https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&c=API+application+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [18:10:44] i'm not familiar enough with load balancing there to know if the load should be spread more evenly, or if those servers are pinned to some specific request types [18:21:08] mutante: thx [18:21:30] brion: welcome, you should now appear on en.planet.wm.org again [18:21:40] woot [18:22:18] oh, and i just realized it shows a little red line under affected blogs.. i never noticed :p [18:22:36] nice! [18:25:37] (03Abandoned) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [18:26:01] (03Abandoned) 10Dduvall: beta: Install MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/310360 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [18:41:08] !log killing stuck tilerator notification processes on maps1001 - T145534 [18:41:09] T145534: maps - tilerator notification seems stuck on sorting files - https://phabricator.wikimedia.org/T145534 [18:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:03] RECOVERY - MegaRAID on db1060 is OK: OK: optimal, 1 logical, 2 physical [18:58:02] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:45] (03Abandoned) 10Thcipriani: Scap: Bump installed version to 3.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/312437 (owner: 10Thcipriani) [19:18:05] 06Operations, 10Mail, 07LDAP, 13Patch-For-Review: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2663215 (10bbogaert) Awesome. Thank you for the help @MoritzMuehlenhoff . I'll make the changes and let you know if there's any trouble. [19:23:12] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:46:31] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2663252 (10AndyRussG) @awight @Pcoombe Just to note, Special:BannerLoader calls always go to metawiki,... [19:48:49] ottomata: around? Your patch for kasocki triggers job that never ends :( https://phabricator.wikimedia.org/D385 [19:49:04] example https://integration.wikimedia.org/ci/job/phabricator-jessie-diffs/150/console [20:02:49] hashar: doh! i dunno why it does that, sometimes it works, sometimes it doesn't [20:03:04] thanks for killing [20:06:36] ottomata: I have made the job to die after 30 minutes :D [20:12:46] (03PS1) 1020after4: Make aptly http_port configurable [puppet] - 10https://gerrit.wikimedia.org/r/312562 [20:17:34] (03PS2) 1020after4: Make aptly http_port configurable [puppet] - 10https://gerrit.wikimedia.org/r/312562 [20:19:56] (03CR) 10jenkins-bot: [V: 04-1] Make aptly http_port configurable [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [20:22:13] bblack: hello! we have a mediawiki change that is going to change a big number of requests that are now returning 200 to 404, see if you approve: https://gerrit.wikimedia.org/r/#/c/312561/1 We get gazillion spamy requests this way so I *think* we might see the effect of this change at the varnishlayer [20:25:47] (03PS3) 1020after4: Make aptly http_port configurable [puppet] - 10https://gerrit.wikimedia.org/r/312562 (https://phabricator.wikimedia.org/T146497) [20:26:23] nuria_: given ops are not around next week, I guess we wanna postpone that change. I have cr-2 it [20:27:07] hashar: ya, no rush i have already pinged-ed brandon about it cause I think we will see it on varnish metrics [20:27:23] nuria_: then that is for non existent pages so.. Probably not a big deal [20:27:23] hashar: do note that your -2 is due to ops being out plizzz [20:27:26] nuria_: I think you should make RoanKattouw's change too based on other 404 sent in MW :) [20:27:40] Hm? [20:27:42] What change? [20:28:01] https://gerrit.wikimedia.org/r/#/c/312561/ [20:28:11] It's saper's change, not mine [20:28:14] Your comment about $wgSend404Code [20:28:16] But yeah we can just merge that, right? [20:28:22] Regardless of ops being out etc [20:28:30] nop [20:28:32] It'll just ride the train the week after next [20:28:34] Why not? [20:29:00] I would like the green / go from cache people [20:29:04] It's in mediawiki/core, not config or anything [20:29:06] cause really I ahve zero idea of the impact [20:29:18] OK, that makes some sense [20:29:24] jaja [20:29:34] but I definitely like the idea of setting proper response code [20:29:39] ya, i already ping-ed brandon [20:30:18] but from their perspective is like, great, less resources for spam [20:30:23] and there is the crazy cdnupdate job that hit history [20:31:53] I can follow up with ops, since they are all in europe next week hashar might be able to contact them more easily so either or [20:32:01] so it is -2 on principle / per precaution [20:32:09] na they will be busy [20:32:16] I dont expect them to be online [20:32:21] k [20:32:30] good you noticed that one nuria [20:32:50] maybe I am just being entirely paranoid anyway [20:33:07] Isn't it still vaguely brandons working day still? ;) [20:33:24] define "working day"? :D [20:33:40] "he's usually about at this time" [20:33:44] yeah [20:33:51] well that patch will land for sure [20:34:01] well, the -1 needs fixing first ;) [20:34:59] Reedy: It's fixed already [20:35:05] You're probably looking at PS1 [20:35:08] yeah [20:35:11] (This is very confusing in the new Gerrit UI) [20:35:15] I pinged brandom on ops [20:35:19] *brandon [20:35:19] fsck gerrit [20:35:32] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4183010 keys - replication_delay is 0 [20:36:25] Lol [20:36:27] That is [20:36:30] fixed in gerrit 2.13 [20:36:41] should be able to tell you which patch is current and not [20:36:57] by showing orange if it is an old patch and showing the normal colour if on latest patch [20:37:06] RoanKattouw ^^ [20:37:17] Oh nice [20:37:47] Right now it does say "Not Current" in bold text, but that's not very useful because there's always bold text in that location, it just usually says "Needs Code-Review" or "Merged" or something [20:37:56] Yep [20:37:58] Plus [20:38:03] Gerrit 2.13 has been released [20:38:11] yesturday and a 2.13.1 release today [20:38:29] sudo su - [20:38:34] puppet agent -tv --debug [20:38:40] ... wrong term [20:39:04] RoanKattouw http://gerrit-new.wmflabs.org/#/c/2/1 and http://gerrit-new.wmflabs.org/#/c/2/2 [20:39:23] Nice [20:39:28] I see, the Patch set thing becomes orange [20:39:42] Yep [20:40:09] RoanKattouw i also contributed a few fixes to polygerrit, to get it working on Internet Explorer :) [20:40:27] it also includes the new unified diff too [20:42:52] (03PS4) 1020after4: Make aptly http_port configurable [puppet] - 10https://gerrit.wikimedia.org/r/312562 (https://phabricator.wikimedia.org/T146497) [20:43:19] You can also search the docs in gerrit now [20:43:33] (03CR) 1020after4: Make aptly http_port configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312562 (https://phabricator.wikimedia.org/T146497) (owner: 1020after4) [20:50:24] (03PS1) 10Hashar: puppet_compiler: conftool settings are now class parameters [puppet] - 10https://gerrit.wikimedia.org/r/312600 [20:51:15] (03CR) 10Hashar: "Puppet fails:" [puppet] - 10https://gerrit.wikimedia.org/r/312600 (owner: 10Hashar) [20:52:11] !log cleaning up leftover system unit files on wdqs1* [20:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:38] I don't always have a well-defined "working day", but today in particular some of us are already traveling and the rest of us are tying up local loose ends in our lives I suspect :) [20:53:01] but yes, we're trying to avoid needing to respond to non-emergencies for the next week. sorry, once a year thing :) [20:56:00] (03CR) 10Alex Monk: [C: 031] "oops" [puppet] - 10https://gerrit.wikimedia.org/r/312600 (owner: 10Hashar) [20:56:13] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:07] (03CR) 1020after4: "turns out that even installing nginx doesn't work with apache installed because the default config uses port 80... grr" [puppet] - 10https://gerrit.wikimedia.org/r/312562 (https://phabricator.wikimedia.org/T146497) (owner: 1020after4) [21:04:47] twentyafterfour: there are ways around that, we have a similar issue with co-installing varnish+nginx (both port 80) on fresh cache boxes [21:04:56] is jenkins-bot down? https://gerrit.wikimedia.org/r/#/c/309553/ [21:05:25] you have to do some hacky puppet things to sequence events: install package X, shut down the default service, install package Y, then reconfigure both, and have the normal runtime management of service X not get messed with by the shutdown of the default service earlier, etc... [21:06:13] bblack: I'm thinking it'd be easier to make aptly not require nginx - factor nginx out into an optional component so that I can just use my existing apache instead of having two web servers on one machine [21:06:53] seems less hacky and simpler [21:07:42] kaldari: Jenkins / CI is low on capacity :( [21:08:07] kaldari: you can always have an idea of how busy it is via https://integration.wikimedia.org/zuul/ :D [21:09:10] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404#2663457 (10debt) 05Open>03Resolved [21:12:03] it would be nice if there was a universal/standard way to tell debian "for these packages, even though they're daemons, don't ever install/start/restart/manage any default service, just manage the software itself" [21:12:22] there's probably some complicated way that maybe we could wrap up in puppet somehow [21:13:15] or maybe debian could break up the packaging as a new standard [21:13:52] something with better naming, but basically turn the package "lighttpd" into "lighttpd-software" and "lighttpd-service" which depends on the former [21:14:22] then at least there's the option in some places to only install the -software part and manage anything else on top separately [21:21:41] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:23:21] bblack: i really dislike that convention where .debs start services on install [21:23:33] i think that should be off by default [21:23:35] you shoudl ahve to manuallys tart [21:26:51] 07Puppet, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2663483 (10hashar) [21:28:39] 07Puppet, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2663487 (10hashar) [21:42:24] !log ebernhardson@tin Synchronized php-1.28.0-wmf.20/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: Additional logging to track down autocomplete timing regression (duration: 00m 50s) [21:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:00] !log Deployed patch for T146425 [22:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:42] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:52] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:50:12] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work): Resolve huge perf regression on autocomplete queries - https://phabricator.wikimedia.org/T146465#2663612 (10EBernhardson) p:05Triage>03High [22:51:22] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work): Resolve huge perf regression on autocomplete queries - https://phabricator.wikimedia.org/T146465#2661894 (10EBernhardson) I can't do much more with this, best i can tell there are a set of overloaded API servers. Adding ops and t... [22:58:12] 06Operations, 10Deployment-Systems: sftp gives bogus "Couldn't stat remote file: No such file or directory" - https://phabricator.wikimedia.org/T146509#2663619 (10Mattflaschen-WMF) [23:13:56] 06Operations: sftp gives bogus "Couldn't stat remote file: No such file or directory" - https://phabricator.wikimedia.org/T146509#2663682 (10greg) [23:44:44] (03PS4) 10Brion VIBBER: static.php should use deployed branch for invalid hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) [23:44:54] (03CR) 10Krinkle: "Added some related documentation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [23:48:47] (03CR) 10Krinkle: [C: 031] "I assume only the change to the array_unshift condition is needed for the main cases. The second change is to make sure a 5-character non-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [23:49:41] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues