[00:36:37] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:01:10] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:17:49] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:42:36] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:23:05] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:24:27] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:29:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 670 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4013713 keys - replication_delay is 670 [02:34:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [02:39:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:47:18] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3989934 keys - replication_delay is 0 [02:47:50] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [02:49:09] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [03:39:07] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2666041 (10Tbayer) Related IRC discussion: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-tech/20160925.txt (seems to have been resolved already: T146569 ). [03:51:29] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2666070 (10Peachey88) [06:28:18] (03Draft2) 10MarcoAurelio: DNS configuration for olo.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/312805 (https://phabricator.wikimedia.org/T146612) [06:28:26] (03Draft1) 10MarcoAurelio: DNS configuration for olo.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/312805 (https://phabricator.wikimedia.org/T146612) [06:28:43] (03CR) 10MarcoAurelio: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/312805 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [06:53:44] (03Draft2) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) [06:53:50] (03Draft1) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) [06:58:27] (03Draft2) 10MarcoAurelio: RESTBase configuration for olo.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/312808 [06:58:31] (03Draft1) 10MarcoAurelio: RESTBase configuration for olo.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/312808 [07:26:43] (03Draft2) 10MarcoAurelio: Labs configuration for olo.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/312812 (https://phabricator.wikimedia.org/T146612) [07:26:48] (03Draft1) 10MarcoAurelio: Labs configuration for olo.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/312812 (https://phabricator.wikimedia.org/T146612) [07:27:36] (03PS3) 10MarcoAurelio: RESTBase configuration for olo.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/312808 (https://phabricator.wikimedia.org/T146612) [07:35:41] 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2666237 (10Marostegui) All good now! ``` Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices : 14 Disks... [07:36:57] 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2666238 (10Marostegui) 05Open>03Resolved [07:51:12] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work): Resolve huge perf regression on autocomplete queries - https://phabricator.wikimedia.org/T146465#2666257 (10dcausse) 05Open>03Resolved a:03EBernhardson Thanks @ema and @EBernhardson ! [07:52:17] (03PS1) 10Urbanecm: [throttle] Rule for Winona State University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312815 (https://phabricator.wikimedia.org/T146600) [08:14:37] (03PS2) 10Urbanecm: [throttle] Rule for Winona State University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312815 (https://phabricator.wikimedia.org/T146600) [08:16:05] (03PS1) 10Urbanecm: [throttle] Rule for Winona State University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312816 (https://phabricator.wikimedia.org/T146600) [08:22:21] morning all. http://wikimediacommons.org/ seems to have stopped working (it's supposed to redirect to commons.wikimedia.org) [08:22:31] do you want a Phab ticket opened ? [08:28:37] NotASpy: yes pelase [08:29:45] looks like this was possibly planned https://phabricator.wikimedia.org/T105981#2233631 [08:32:02] maybe but the entries are still in puppet so that at least could be cleared up on the ticket [08:32:08] modules/mediawiki/files/apache/sites/redirects.conf this file still has them [08:32:20] also https://phabricator.wikimedia.org/T101048 [08:32:42] but neither of them has recent updates over the last few days/weeks [08:33:03] might as well reference them in the new ticket too [08:33:22] then hopefully we can get a clear decision / status [08:35:58] robh: whenyou get here can you change the topic to make me the clinic duty person? [08:36:12] done =] [08:36:16] thank you! [08:42:10] https://phabricator.wikimedia.org/T146619 for your viewing pleasure. [08:46:57] thank you, NotASpy [08:47:00] redirection policy task, eeeee [08:47:05] * robh appends project and otherwise avoids [08:47:32] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666372 (10ArielGlenn) Note that the redirects are still in puppet: modules/mediawiki/files/apache/sites/redirects.conf [08:47:38] There are strong viewpoints about using all those alternative domain names/urls. [08:48:13] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666373 (10ArielGlenn) p:05Triage>03Normal [08:48:53] robh, should I add you as a subscriber on this ticket? [08:48:58] hellllll no [08:49:01] hahahaha [08:49:01] ;] [08:49:22] NotASpy: note that there will probably be no movement on it this week, the opsen are at an off site so only dealing with emergencies [08:49:35] I did ask the question if we should just nuke all instances of redirecting domains, quite happy to do that, I have no strong feeling either way. [08:50:21] also the policy is a org wide one, so ops cannot make it in a silo [08:50:34] we do need to decide on a wholesale policy on how to treat redirection by default. [08:51:48] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666375 (10ArielGlenn) Adding @Krenair and @BBlack from the other tickets. [08:52:16] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666377 (10Krenair) Looks like it just has no A records in DNS [08:53:13] Krenair, yeah they were removed apparently as part of cleanup [08:53:40] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666378 (10Krenair) Ah, it was https://gerrit.wikimedia.org/r/#/c/244092/ [08:53:43] but should they have been? then should the redirects go too? or is it premature, I have no idea [08:53:44] so [08:53:57] I punt to you folks :-D [09:09:27] Now all the redirecting domains can just use Let's encrypt, can't they [09:11:04] it's not quite that simple due to the scale of the problem [09:11:20] LE limits certs to 100 domains each [09:13:21] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:14:11] that limit could do us in pretty quick [09:15:01] pretty sure we're already over that limit [09:17:10] (in terms of number of domains we'd want on there) [09:17:35] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:21:49] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 38458 MB (3% inode=99%) [09:31:37] RECOVERY - Disk space on maps-test2001 is OK: DISK OK [09:40:30] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:41:02] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2666428 (10Krinkle) [09:41:13] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#1494832 (10Krinkle) [09:42:24] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:45:20] (03PS1) 10Yuvipanda: k8s: Make the kubernetes user be a member of ssl-certs group [puppet] - 10https://gerrit.wikimedia.org/r/312817 [09:46:25] (03CR) 10jenkins-bot: [V: 04-1] k8s: Make the kubernetes user be a member of ssl-certs group [puppet] - 10https://gerrit.wikimedia.org/r/312817 (owner: 10Yuvipanda) [09:46:25] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [09:46:32] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:47:35] (03PS2) 10Yuvipanda: k8s: Make the kubernetes user be a member of ssl-certs group [puppet] - 10https://gerrit.wikimedia.org/r/312817 [09:48:10] Jeff_Green: ^ you are workign on betelgeuse right? [09:48:28] codfw frack host. [09:48:38] robh yeah, i rebooted it for a kernel update and apparently it didn't come back within the 10 minute window of the downtime [09:48:40] looking [09:48:59] cool, just wanted to ensure you were aware [09:49:02] thx [09:49:24] aaaand my android phone has no signal because it probably needs a restart [09:51:03] fixed. now I can get pages :-P [09:54:34] \o/ [09:54:44] 'lo apergos [09:58:42] 07Puppet, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2666450 (10hashar) p:05Triage>03Low [10:00:24] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 37.90 ms [10:04:22] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666339 (10Bawolff) Stupid question - if its just about cost of certs, can't we use LetsEncrypt? [10:05:37] 06Operations, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery: Puppet sslcert::ca does not refresh the certificate symlinks when a .crt is updated - https://phabricator.wikimedia.org/T145609#2666467 (10hashar) [10:06:08] 06Operations, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery: Puppet sslcert::ca does not refresh the certificate symlinks when a .crt is updated - https://phabricator.wikimedia.org/T145609#2635904 (10hashar) p:05Triage>03Normal [10:06:41] morning Bsad owski [10:06:44] 1 [10:06:45] heh [10:07:01] Bsadowski1: 100 cert limit from LE [10:07:34] I"m going torun around the corner and get breakfast foods (and lunch at the same time), omg it's already 1 pm [10:07:36] brb [10:09:31] (03CR) 10Marostegui: [C: 031] mariadb: fix class dependency on beta [puppet] - 10https://gerrit.wikimedia.org/r/312652 (owner: 10Hashar) [10:11:27] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:13:47] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Redirect yue.wikipedia.org to zh-yue.wikipedia.org - https://phabricator.wikimedia.org/T105999#2666481 (10Liuxinyu970226) [10:20:48] 07Puppet: Investigate usage of hiera_hash in our puppet repo - https://phabricator.wikimedia.org/T146621#2666483 (10yuvipanda) [10:20:51] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#2666500 (10hashar) [10:20:54] 06Operations, 10Beta-Cluster-Infrastructure: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2666497 (10hashar) 05Open>03Resolved a:03hashar I have removed the group. Every project members already had root access anyway. [10:22:11] 07Puppet, 06Labs, 10Labs-Infrastructure: Investigate usage of hiera_hash in our puppet repo - https://phabricator.wikimedia.org/T146621#2666501 (10Andrew) [10:22:55] oops forgot to say: back. [10:36:03] 07Puppet, 10Beta-Cluster-Infrastructure, 06Discovery, 10Wikimedia-Portals, 13Patch-For-Review: beta-mediawiki-config-update-eqiad failing with merge conflict in portals - https://phabricator.wikimedia.org/T129427#2666573 (10hashar) 05Open>03Resolved Havent seen this one happening again. I am assuming... [10:36:47] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2666578 (10hashar) [10:37:40] (03PS2) 10Hashar: beta: drop deployment-tin add deployment-tin02 [puppet] - 10https://gerrit.wikimedia.org/r/312654 (https://phabricator.wikimedia.org/T144006) [10:38:26] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2666597 (10hashar) The deployment servers have been reimaged to Jessie: * deployment-mira * deployment-tin02 Last patch to land is https://gerrit.wiki... [10:42:21] 06Operations, 10hardware-requests, 10netops, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2666607 (10hashar) 05Open>03Resolved We had contint1001 allocated. It has a p... [10:48:42] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:50:11] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [10:50:58] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666629 (10yuvipanda) [10:52:46] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666647 (10yuvipanda) [10:53:20] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2666650 (10deryckchan) As someone without developer access, I gather from the discussion so far (over 7 years) that we're solving four different problems of various... [10:55:12] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [10:56:47] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2666657 (10hashar) [10:56:50] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666656 (10hashar) [10:57:07] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666629 (10hashar) p:05Triage>03Normal [11:00:12] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:05:12] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:08:38] rats, no Jeff_Green for lutetium [11:10:12] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:13:33] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:15:12] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:16:11] sms sent [11:20:12] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:25:14] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:30:06] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:33:06] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:35:08] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:40:08] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:45:12] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:50:10] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:55:07] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [11:57:44] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:00:14] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [12:05:09] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [12:10:12] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [12:14:35] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2666887 (10Verdy_p) "3. Change the "traditional" MediaWiki interwiki prefix (not so important because Wikidata has made that mostly obsolete)" That's wrong, we ver... [12:15:11] PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [12:18:34] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:59] ACKNOWLEDGEMENT - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) arielglenn will be investigated in a little while (msg from Jeff) [12:20:42] 06Operations, 10Datasets-General-or-Unknown: Reboot snapshot servers - https://phabricator.wikimedia.org/T146127#2666894 (10ArielGlenn) snapshot1001,5 done. The other two have jobs running on them that will finish up Tuesday. [12:29:52] (03CR) 10Hashar: "Yup the recursive strategy causes "git rebase" to always rebase even when the end result would be a noop :D" [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar) [12:40:44] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 661 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3980966 keys - replication_delay is 661 [12:43:30] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:52:45] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:14] ^^ Phabricator cant reach the database [12:54:01] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:07] looking [12:54:09] we're looking [12:54:13] oh good [12:58:39] and it's back :) [12:58:48] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 27146 bytes in 0.227 second response time [13:00:08] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 27146 bytes in 0.217 second response time [13:00:17] magic [13:00:44] phew, TY! [13:01:18] I was looking but got nowhere near to located the issue before the recovery [13:09:05] apergos: I think jynus has a fair idea about a large delete causing a lock and so unavailability, he's giving #releng a heads up I believe [13:09:22] oh my [13:09:28] good to know, thanks [13:10:13] RECOVERY - check_mysql on lutetium is OK: Uptime: 1332 Threads: 1 Questions: 123396 Slow queries: 15 Opens: 8035 Flush tables: 2 Open tables: 64 Queries per second avg: 92.639 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:13:25] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:16] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:22:06] Phabricator seems to have DB issues again for me - "Can Not Connect to MySQL" [13:22:08] Phabricator can't connect to MySQL [13:22:11] jynus_: ^ [13:22:15] same as 20min ago [13:23:09] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:01] (03CR) 10Thcipriani: [C: 031] "very nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar) [13:24:26] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:26:50] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 27147 bytes in 4.656 second response time [13:28:01] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 27146 bytes in 0.201 second response time [13:29:00] maybe jynus you would peek at that a bit more? ^^ since it's flapping a little [13:29:23] hm not in here [13:29:58] 06Operations, 10Traffic: Remove "GeoIP lookup" service from https://status.wikimedia.org - https://phabricator.wikimedia.org/T146638#2666993 (10Aklapper) [13:30:24] apergos: it's dropping in and out based on an aria locking issue, jynus is going ot get into it and we'll regenarate the search indexes if needed [13:30:27] apergos: he's working on it (sitting next to me) [13:30:36] awesome, thank you [13:30:41] 06Operations, 06Performance-Team, 10Thumbor: Temp files not cleaned up on conversion error - https://phabricator.wikimedia.org/T146262#2667101 (10Gilles) [13:30:43] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2667032 (10deryckchan) @Verdy_p I agree with you. > - adding an interwiki and aliasing the former one, and checking that "#language:" correctly resolves both code... [13:30:51] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:39] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:42:19] The table is being converted to InnoDB, it is still running [13:43:55] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [13:44:06] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [13:44:56] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:49:36] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2667102 (10Arseny1992) [13:50:05] 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2667177 (10Marostegui) thanks [13:50:53] The table is now converted [13:51:19] Searching works, but it doesn't give results. We are going to regenerate the search index [13:51:25] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [13:51:36] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [13:52:18] !log iridium phab ./bin/search index --all [13:52:22] !log phabricator is back in write mode - search is degraded. we are regenerating the indexes [13:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:47] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:01:33] 06Operations: sftp gives bogus "Couldn't stat remote file: No such file or directory" - https://phabricator.wikimedia.org/T146509#2663619 (10ArielGlenn) Is there a file of that name there already? If not, maybe you don't want 'resume'... [14:18:08] 06Operations, 10Monitoring, 06Performance-Team, 06Release-Engineering-Team, 07Wikimedia-Incident: MediaWiki load time regression should trigger an alarm / page people - https://phabricator.wikimedia.org/T146125#2651529 (10ori) #performance-team is considering making this the focus of our off-site. [14:21:42] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2667326 (10Esc3300) Thanks for fixing this. Afterwards, occasionally, I got old data and the "data updated"... [14:24:07] jynus_: marostegui: well done :] [14:24:21] heavy search and MyISAM aren't playing nice are they? [14:34:59] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666339 (10AlexMonk-WMF) >>! In T146619#2666465, @Bawolff wrote: > Stupid question - if its just about cost of certs, can't we use LetsEncrypt? It's not that stupid. Let's Encrypt... [14:42:51] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2667400 (10ArielGlenn) I understand there is a 100 cert limit for Let's Encrypt. Looking at this: https://letsencrypt.org/docs/rate-limits/ it's not clear to me the exact limits. [14:44:42] (03PS3) 10EBernhardson: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 [14:44:49] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2667404 (10Krenair) >>! In T21986#2666650, @deryckchan wrote: > As someone without developer access, I gather from the discussion so far (over 7 years) that we're s... [14:44:55] (03CR) 10EBernhardson: "seems there isn't any harm in having both, updated." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson) [14:48:56] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2667411 (10yuvipanda) I'm no longer convinced we have to do this. https://phabricator.wikimedia.org/T91990 should cover most of the things we want from this. [14:51:06] (03CR) 10Thcipriani: [C: 031] "Looks like everywhere that this change needs to be made." [puppet] - 10https://gerrit.wikimedia.org/r/312654 (https://phabricator.wikimedia.org/T144006) (owner: 10Hashar) [15:17:07] (03CR) 10Thcipriani: [C: 031] "lgtm. Just created a task to remove the hard-coded upstart commands @hashar found in scap: T146656" [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson) [15:22:59] (03PS4) 10EBernhardson: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) [15:23:30] (03CR) 10Hashar: [C: 031] "Neat thank you Erik. Tyler filled T146656 to track the removal of the old commands." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [15:25:43] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2667483 (10Gilles) [15:25:46] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2667484 (10Gilles) [15:25:49] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2636021 (10Gilles) [15:25:52] 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2667487 (10Gilles) [15:25:55] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2635959 (10Gilles) [15:25:58] 06Operations, 06Performance-Team, 10Thumbor: Temp files not cleaned up on conversion error - https://phabricator.wikimedia.org/T146262#2667490 (10Gilles) [15:26:00] (03PS3) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) [15:26:01] 06Operations, 06Performance-Team, 10Thumbor: Make the 100MB+ test files downloaded from their source instead of being in the git repo - https://phabricator.wikimedia.org/T145785#2667488 (10Gilles) [15:26:04] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2667489 (10Gilles) [15:44:30] I"m here but for te next while if someone needs me they should ping, I won't be paying close attention [15:53:20] 06Operations, 10Recommendation-API: Backport python3-sklearn and python3-sklearn-lib from sid - https://phabricator.wikimedia.org/T133362#2667567 (10ori) 05Open>03declined >>! In T133362#2575978, @yuvipanda wrote: > I also think deb packaging for this is going town a long, unrecoverable rabbit hole, and wo... [15:56:28] (03PS1) 10Urbanecm: [throttle] Rule for Winona State University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312842 (https://phabricator.wikimedia.org/T146600) [16:00:25] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250#2667598 (10N... [16:03:41] 06Operations: create notifications about user accounts that have not been used for a long time - https://phabricator.wikimedia.org/T146657#2667606 (10Dzahn) [16:36:45] 06Operations, 10ORES: Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#2667780 (10GWicke) [16:37:04] 06Operations, 10ORES, 06Services: Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#2667792 (10GWicke) [16:55:27] (03PS1) 10Urbanecm: [throttle] Ada Lovelave Day Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312852 (https://phabricator.wikimedia.org/T146654) [16:56:04] (03CR) 10jenkins-bot: [V: 04-1] [throttle] Ada Lovelave Day Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312852 (https://phabricator.wikimedia.org/T146654) (owner: 10Urbanecm) [16:56:22] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:40] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250#2667997 (10a... [17:21:23] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:22:42] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [17:25:05] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:30:32] citoid. sigh [17:30:34] nothing full [17:35:15] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:37:36] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:39:41] "recovered" ok but without my intervention [17:45:12] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [17:46:07] and there we are again [17:56:53] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:57:33] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:01:49] the local tests all pan out fine, I don't see anything newly bad in the logs, etc [18:02:33] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:02:38] eyeroll [18:02:40] thanks [18:10:14] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:15:05] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:15:28] groan [18:15:38] and as before local tests return results [18:15:54] how are you running them, and where? [18:16:11] * ori is kibitzing [18:17:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:17:34] thank please do, because I am getting nowhere [18:17:45] run on scb1001,2 [18:18:00] https://wikitech.wikimedia.org/wiki/Citoid under the 'testing' section [18:21:43] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:22:43] I see entries ticking by on sca1001 in the zotero log so I really dunno [18:23:59] request latency is very uneven [18:24:04] the check that issues this alert is check_wmf_service!http://citoid.svc.codfw.wmnet:1970!15 [18:24:21] check_wmf_service is /usr/bin/service-checker-swagger -t $ARG2$ $HOSTNAME$ $ARG1$ [18:24:35] so the invocation is: /usr/bin/service-checker-swagger -t 15 localhost 'http://citoid.svc.codfw.wmnet:1970' [18:24:41] (assuming you are on scb1001,2) [18:24:57] well I can try the icinga check certainly [18:25:03] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [18:25:30] if I do: while true; do time /usr/bin/service-checker-swagger -t 15 localhost 'http://citoid.svc.codfw.wmnet:1970' ; sleep 1 ; done [18:25:45] sometimes I get 'All endpoints are healthy' in a few seconds, sometimes it takes over a minute [18:25:49] at least once I got a timeout [18:26:03] codfw? [18:26:18] these should be for eqiad [18:26:27] hah, yes, that's wrong. but it's hitting localhost, so the vhost name shouldn't matter much [18:27:16] my check just told me healthy [18:27:17] meh [18:27:24] run it in a loop [18:27:30] yeah I'm about to run it a few times [18:28:46] for i in `seq 20`... [18:28:55] and a 5 sec sleep, let's see what happens [18:29:11] or is that "you won't believe what happens next" :-P [18:29:16] service-checker-swagger is a python util written by _joe_ apparently: https://github.com/lavagetto/service-checker/blob/master/checker/service.py . i'd live-hack it to print latency for each request it makes. it uses the spec to make multiple reqs [18:29:48] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:30:20] two healthies and a timeout so far [18:30:29] !sal [18:30:29] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [18:31:56] it didn't help me here, hashar :-P [18:32:23] ah that was for me [18:32:27] ah ha :-D [18:32:33] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:32:38] wanted to check whether someone noticed logrotate failing on fluorine [18:32:55] well not failling [18:33:02] just logrotated later than I would have expected [18:37:03] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:37:35] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:39:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:39:56] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:41:47] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:43:57] six timeouts out of twenty [18:44:19] ganglia doesnt' show any big changes in load on sca* or scb* over the last 4 hours [18:47:26] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:47] mind if I live-hack /usr/lib/python2.7/dist-packages/servicechecker/swagger.py to print latency and URL for each request? [18:48:49] apergos: ^ [18:48:51] on scb1001 [18:48:51] hashar: seeing as I'm getting nowhere on the flapping citoid, what log rot were you looking for in particular on fluorine? [18:49:02] ori: go ahead, please make a copy of the orig in the same place [18:49:06] yep [18:49:19] I mean it's certainly well over the 10 sec reply that icinga wants, it seems [18:49:25] .py -> .bak [18:49:30] cool [18:49:52] apergos: /a/mw-logs/api.log which has its first entry at roughly 8:40 [18:49:59] would have expected 6:45 or so [18:50:03] ah [18:50:05] lemme see [18:51:29] the api-feature log has messages starting at 6:45 [18:51:32] apergos: it is not important really [18:51:48] so general log rot went off on time as usual [18:52:07] actually [18:52:14] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:52:18] we might want to drop api.log entirely [18:52:23] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:52:24] I am not sure whether it has a purpose anymore [18:53:06] but many many of these logs appear to have starting times of 08:45 or so [18:53:18] as though one restart took a very long time [18:53:33] that log every single requests made to the api [18:53:55] sometimes it takes very long to gzip [18:54:22] maybe we can logrotate it on an hourly basis [18:54:33] and look at stopping those logs [18:55:28] I see: [18:55:35] there are a number of logs in here of course [18:55:41] some take a little while to do the rot/gzip [18:55:43] that time adds up [18:56:18] it's the /api?search=http%3A%2F%2Fexample.com&format=bibtex request [18:56:21] (re: citoid) [18:56:22] the api log is 25 down in the list or so [18:56:44] apergos: dont waste your time on the logrotation :] [18:57:09] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:57:11] centralauthrename takes 20 minutes all by itself [18:57:15] so that's your answer [18:57:24] no time wasted, but no fix either :-D [18:57:44] ori: wait what? [18:58:35] I mean I see that's the one with whining about the timeout [18:58:39] but what is that even [18:58:41] the checker checks multiple endpoints: /, /_info, and /api. It's the latter one [18:58:42] example.com? [18:59:02] oh the search.... [18:59:11] groan [18:59:29] now officially and completely out of my depth :-P [18:59:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [19:01:48] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:07:04] I would expect the checker to use one of the examples in here: https://github.com/zotero/translators or am I completely off base? [19:14:56] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2668421 (10awight) I'm digging a rabbithole through the cache vaults, and found this interesting resul... [19:18:12] akosiaris, if/when you are around would you mind having a look in on the citoid issue? (see scrollback) [19:20:57] I'm officially off unless there's an emergency site-wise [19:21:04] of course so is the rest of ops :-/ [19:21:15] once a year, that's how it is [19:29:05] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:34:54] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:54] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:37:10] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [19:37:10] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [19:40:27] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [19:45:55] (03PS1) 10Catrope: Follow-up fd8998a4ec9: remove another stray $wmgMFUseCentralAuthToken reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312867 [19:46:04] jdlrobson: ---^^ [19:48:44] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 38573 MB (3% inode=99%) [19:51:39] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2668518 (10awight) I'm using something closer to the original command for probing the message cache, a... [19:52:44] (03CR) 10Hashar: [C: 031] "Looks fun :] Filippo would Prometheus be able to replace Shinken/Icinga? Alarming on a graphite metric is a typical use case for us (both" [puppet] - 10https://gerrit.wikimedia.org/r/304263 (https://phabricator.wikimedia.org/T141785) (owner: 10Thcipriani) [19:53:55] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:58:20] (03PS1) 10Kaldari: Deploying PageAssessments to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312869 (https://phabricator.wikimedia.org/T146679) [20:05:24] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:07:54] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:13:10] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2668547 (10spatton) @awight our typical practice is to disable a given campaign before we swap new ban... [20:35:21] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:54:44] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:55:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 28806 seconds ago, expected 28800 [21:00:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29106 seconds ago, expected 28800 [21:05:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29406 seconds ago, expected 28800 [21:07:55] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2668865 (10awight) @spatton I've changed some things about the backend in order to help diagnose this... [21:10:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29706 seconds ago, expected 28800 [21:15:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30006 seconds ago, expected 28800 [21:16:12] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 38732 MB (3% inode=99%) [21:20:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30306 seconds ago, expected 28800 [21:22:34] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:25:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30606 seconds ago, expected 28800 [21:30:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30906 seconds ago, expected 28800 [21:35:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31206 seconds ago, expected 28800 [21:40:12] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31506 seconds ago, expected 28800 [21:45:12] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31806 seconds ago, expected 28800 [21:50:12] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32106 seconds ago, expected 28800 [21:55:13] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32406 seconds ago, expected 28800 [22:00:13] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32706 seconds ago, expected 28800 [22:05:13] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33007 seconds ago, expected 28800 [22:10:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33306 seconds ago, expected 28800 [22:15:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33606 seconds ago, expected 28800 [22:18:01] (03PS5) 10Hoo man: More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 [22:19:09] (03CR) 10Hoo man: [C: 031] "Fixed a few minor things. Manually verified this by dumping testwikidatawiki to /tmp on snapshot1007." [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man) [22:20:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33907 seconds ago, expected 28800 [22:24:58] 06Operations, 10Mail, 07LDAP, 13Patch-For-Review: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2669082 (10bbogaert) Hi, I made the change to corp LDAP. I have been able to add the wikimediaPerson objectClass and YubiKeyVPN attribute to myself. Can you check if LDAP... [22:25:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 34207 seconds ago, expected 28800 [22:27:11] (03PS1) 10Jon Harald Søby: Adding language name configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312944 (https://phabricator.wikimedia.org/T146707) [22:30:06] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 34506 seconds ago, expected 28800 [22:31:06] (sorry for the ping, sir) [22:35:13] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 34807 seconds ago, expected 28800 [22:36:59] those are annoying [22:40:13] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 35106 seconds ago, expected 28800 [22:45:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 35406 seconds ago, expected 28800 [22:50:08] (03PS1) 10Thcipriani: Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 [22:50:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 35707 seconds ago, expected 28800 [22:55:12] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 36006 seconds ago, expected 28800 [23:00:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 36306 seconds ago, expected 28800 [23:05:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 36606 seconds ago, expected 28800 [23:10:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 36906 seconds ago, expected 28800 [23:10:24] we get it [23:11:44] heh [23:12:01] I found this amusing: "Puppet last ran 28806 seconds ago, expected 28800" [23:12:21] :p [23:12:38] so close. [23:15:08] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 37206 seconds ago, expected 28800 [23:18:49] PROBLEM - very high load average likely xfs on ms-be1002 is CRITICAL: CRITICAL - load average: 100.07, 100.55, 99.10 [23:19:00] now that's not nothing ^ [23:20:08] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 37506 seconds ago, expected 28800 [23:21:08] (03CR) 10Mattflaschen: [C: 04-1] "Although flow_computed/flow-computed could be used at runtime, it intentionally is not, for performance reasons." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 (owner: 10Dereckson) [23:22:16] (03CR) 10Mattflaschen: "Also, you moved it out of the dblist directory. It should stay there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 (owner: 10Dereckson) [23:25:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 37806 seconds ago, expected 28800 [23:29:42] 06Operations, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669336 (10Halfak) [23:30:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 38106 seconds ago, expected 28800 [23:30:19] Hey folks, I'm looking for help getting paging (email) from icinga for ores.wikimedia.org. [23:30:20] See https://phabricator.wikimedia.org/T146720 [23:30:34] 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669349 (10Halfak) [23:30:39] halfak: offsite this week in Spain, they're probably all off drinking/sleeping [23:30:48] Oh yeah. Thanks. [23:30:50] :) [23:30:59] but, +1 to the request :) [23:31:00] But then again, this is helping them fill the gap of awayness. [23:31:16] Hmm... maybe I can wake up really early tomorrow and call in a favor [23:31:39] halfak: I replied just to Amir re the incident, should have adde dyou, but: a quick/short incident report with follow-ups (like that) would be good to have [23:31:40] I just realized I get pings when ores.wmflabs.org goes down -- but not prod! [23:31:52] greg-g, in progress! [23:32:05] be still my heart [23:32:10] :) [23:33:27] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2669351 (10awight) p:05Unbreak!>03High Reducing the priority because we're not actively losing ban... [23:35:12] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 38406 seconds ago, expected 28800 [23:40:12] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 38706 seconds ago, expected 28800 [23:40:25] (03CR) 10Alex Monk: "more details about the bug in the commit message of I815ae9e5" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani) [23:41:32] PROBLEM - very high load average likely xfs on ms-be1002 is CRITICAL: CRITICAL - load average: 102.87, 101.37, 99.52 [23:43:32] (03CR) 10Alex Monk: [C: 031] "maybe clarify the commit message, but the code appears to work" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani) [23:45:12] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 39007 seconds ago, expected 28800 [23:50:12] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 39306 seconds ago, expected 28800 [23:55:13] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 39607 seconds ago, expected 28800 [23:57:13] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues