[00:00:59] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#1562853 (10Dzahn) [00:01:16] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#1562855 (10Dzahn) p:5High>3Normal [00:04:35] thcipriani: you should really do ori a solid and promote that tool out of the bowels of our puppet repo into a proper project too [00:07:10] it is a really nice elegant solution to that particular problem. [00:15:50] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1562866 (10Dzahn) **migration plan for mailman** objectives: - move away from server sodium (lucid) to server fermium (jessie) to get rid of the last lucid box in all of WMF - upgrade m... [00:26:42] !log deleting blog.sh and blog_pageviews crontab from stat1003 [00:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:34] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1562900 (10Dzahn) [00:34:15] 6operations, 10Wikimedia-Mailing-lists: setup rsyncd on fermium to copy files from sodium - https://phabricator.wikimedia.org/T109921#1562903 (10Dzahn) 3NEW a:3Dzahn [00:35:50] 6operations, 10Wikimedia-Mailing-lists: setup rsyncd on fermium to copy files from sodium - https://phabricator.wikimedia.org/T109921#1562903 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/231190/ https://gerrit.wikimedia.org/r/#/c/231333/ https://gerrit.wikimedia.org/r/#/c/231394/ https://gerrit.wikimedia.org/r... [00:35:56] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1562912 (10Dzahn) [00:35:57] 6operations, 10Wikimedia-Mailing-lists: setup rsyncd on fermium to copy files from sodium - https://phabricator.wikimedia.org/T109921#1562911 (10Dzahn) 5Open>3Resolved [00:37:30] 6operations, 10Wikimedia-Mailing-lists: write script to import mailing lists from other server - https://phabricator.wikimedia.org/T109922#1562914 (10Dzahn) 3NEW a:3Dzahn [00:38:07] 6operations, 10Wikimedia-Mailing-lists: write script to import mailing lists from other server - https://phabricator.wikimedia.org/T109922#1562914 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/232287/2/modules/mailman/files/scripts/import_list.sh https://gerrit.wikimedia.org/r/#/c/232287/2/modules/mailman/files... [00:38:13] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1562924 (10Dzahn) [00:38:15] 6operations, 10Wikimedia-Mailing-lists: write script to import mailing lists from other server - https://phabricator.wikimedia.org/T109922#1562923 (10Dzahn) 5Open>3Resolved [00:39:55] 6operations, 10Wikimedia-Mailing-lists: add public IP for fermium - DNS and DHCP change for reinstall - https://phabricator.wikimedia.org/T109923#1562925 (10Dzahn) 3NEW a:3Dzahn [00:40:34] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with jessie and public IP - https://phabricator.wikimedia.org/T109924#1562935 (10Dzahn) 3NEW a:3Dzahn [00:41:31] 6operations, 10Wikimedia-Mailing-lists: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1562942 (10Dzahn) 3NEW a:3Dzahn [00:48:46] (03PS1) 10BBlack: Fix remaining wikidata login issues: duplicate CA User [puppet] - 10https://gerrit.wikimedia.org/r/233086 (https://phabricator.wikimedia.org/T109038) [00:50:24] (03CR) 10BBlack: [C: 032] "Code is copypasta from directly above with just s/_Token/_User/, should be safe!" [puppet] - 10https://gerrit.wikimedia.org/r/233086 (https://phabricator.wikimedia.org/T109038) (owner: 10BBlack) [01:02:08] (03CR) 10Dzahn: [C: 031] cassandra: Mute strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/233073 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [01:05:30] 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1562997 (10Dzahn) p:5Unbreak!>3High [01:24:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 10.00% of data above the critical threshold [500.0] [01:26:16] wtf was that? [01:41:42] no idea [01:46:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:05:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [02:05:33] PROBLEM - Last backup of the tools filesystem on labstore1002 is CRITICAL - Last run result was exit-code [02:09:42] RECOVERY - Kafka Broker Replica Max Lag on analytics1021 is OK Less than 1.00% above the threshold [1000000.0] [02:20:53] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:20:55] !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 06m 09s) [02:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:12] (03PS1) 10Tim Landscheidt: gridengine: Ensure that service gridengine-exec is running [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) [02:31:52] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta; I scrapped toolsbeta-exec-201 and toolsbeta-exec-01 because they were fubar. It took me a while to understand that "" [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt) [02:53:55] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [03:02:12] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:04:03] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18135 bytes in 1.043 second response time [03:05:52] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:06:04] PROBLEM - Last backup of the others filesystem on labstore1002 is CRITICAL - Last run result was exit-code [03:19:32] PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail [03:44:42] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:01:33] RECOVERY - Last backup of the maps filesystem on labstore1002 is OK - Last run successful [05:46:53] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 12.00% of data above the critical threshold [100000000.0] [05:48:53] RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [06:26:32] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 285 MB (3% inode=84%) [06:28:24] RECOVERY - Disk space on iridium is OK: DISK OK [06:29:45] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [06:31:43] PROBLEM - puppet last run on db2044 is CRITICAL Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on db2055 is CRITICAL Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 2 failures [06:32:33] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:32:42] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:32:44] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 2 failures [06:56:24] RECOVERY - puppet last run on db2044 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:33] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:14:13] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL - Socket timeout after 10 seconds [08:14:51] I'm looking into ^ [08:16:10] ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on iridium is CRITICAL - Socket timeout after 10 seconds 20after4 looking into it [08:17:54] hmm seems like it's working.. [08:18:03] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21162 bytes in 0.232 second response time [08:19:57] twentyafterfour: need any help ? [08:20:15] still not working for me btw [08:23:35] twentyafterfour: look into error.log [08:24:03] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL - Socket timeout after 10 seconds [08:31:43] apache error log? [08:31:53] akosiaris: I didn't see anything immediately obvious [08:31:59] server isn't overloaded [08:32:19] but apache is set to max 150 workers, I'm not sure if that's enough [08:32:30] twentyafterfour: As received by the server, this request had a nonzero content length but no POST data.\n\nNormally, this indicates that it exceeds the 'post_max_size' setting in the PHP configuration on the server. Increase the 'post_max_size' setting or reduce the size of the request.\n\nRequest size according to 'Content-Length' was '30', 'post_max_size' is set to '10M'. [08:32:59] obviously the recommendation to increase post_max_size is wrong [08:32:59] akosiaris: I see that all the time [08:33:15] really ? I looked before the incident and there isn't any for like hours [08:33:18] if the content-length is '30' that's way less than the 10m limit [08:33:47] akosiaris: well I see those errors regularly when I look in the logs, I never analyzed how frequent they were or anything [08:34:15] I'm not sure what would cause that either [08:34:24] I'm gonna try restarting apache [08:34:29] malformed requests ? [08:34:29] just because I'm stumped [08:34:32] grep post_max_size phabricator_error.log.1 | wc -l [08:34:32] 60 [08:34:32] root@iridium:/var/log/apache2# grep post_max_size phabricator_error.log | wc -l [08:34:32] 543 [08:34:40] akosiaris: yes malformed requests is what I had assumed it was [08:34:41] Request: GET http://phabricator.wikimedia.org/maniphest/task/create/, from 10.64.0.106 via cp1069 cp1069 ([10.64.0.106]:80), Varnish XID 1377408704 Error: 503, Service Unavailable at Sat, 22 Aug 2015 08:22:51 GMT [08:34:42] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [08:34:49] Nemo_bis: yeah, known [08:34:55] Is what the user sees, if useful [08:35:00] Yeah I saw [08:35:02] twentyafterfour: yeah, I concur. restart apache [08:35:23] twentyafterfour: did you stop phd or it just died [08:35:26] ? [08:35:43] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21162 bytes in 0.226 second response time [08:35:44] stopped it [08:35:59] !log restarted apache2 on iridium [08:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:16] !log restarted phd on iridium [08:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:43] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 8 processes with UID = 997 (phd) [08:36:44] apache was using the max 150 processes, I think we need to increase the limit in mpm_worker.conf [08:37:16] now it's using 70 [08:37:36] twentyafterfour: [Sat Aug 22 08:35:37.858581 2015] [core:notice] [pid 46392] AH00052: child pid 5083 exit signal Segmentation fault (11) [08:37:46] hmmm [08:37:58] * paravoid lurks [08:38:02] ok, that would explain the problem. now what's causing the segfault [08:38:07] lemme know if you need any help [08:38:22] was that an apache worker segfaulting? [08:39:01] * twentyafterfour hasn't seen an apache segfault in a few years. those used to be interesting to debug [08:39:16] should be. logged in /var/log/apache2/error.log so I suppose master logged the child segfaulting ? [08:39:34] also [08:39:37] [34555253.446779] do_IRQ: 6.112 No irq handler for vector (irq -1) [08:39:39] in dmesg [08:39:43] this is weird [08:39:45] hmm [08:39:46] first time I see this [08:40:01] that's fine, ignore it [08:40:28] the apache segfault is most likely actually a php segfault, right? [08:40:29] yeah, I just looked at the timestamps [08:40:38] the do_IRQ thing it's unrelated [08:40:53] twentyafterfour: that would be my guess right now [08:41:32] the output of iostat looks a little weird to me [08:43:22] md0 and md2 should be seeing more reads, I would think... and what is md1, seems to be unused but it's allocated [08:44:48] syslog has messages from add dates interspersed with current messages [08:45:55] Aug 2 05:05:01 iridium CRON[31032]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) [08:45:57] Aug 22 08:41:00 iridium kernel: [34563405.732872] ERST: NVRAM ERST Log Address Range not implemented yet. [08:50:01] this is weird.. the restart obviously fixed the issue whatever it was [08:50:11] and we never got a corefile to gdb it [08:58:08] Yeah agreed it's weird and a little uncomfortable [09:08:41] I wonder why there was no core dump. Is core dumping disabled? /me forgets what controls that [09:09:51] hmm, needs to be configured with CoreDumpDirectory directive [09:15:54] There is also a strange log entry, right before all the segfaults [09:16:12] [Sat Aug 22 06:28:27.664577 2015] [mpm_prefork:notice] [pid 46392] AH00163: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.11 configured -- resuming normal operations [09:16:14] [Sat Aug 22 06:28:27.664653 2015] [core:notice] [pid 46392] AH00094: Command line: '/usr/sbin/apache2' [09:16:16] [Sat Aug 22 08:35:37.791160 2015] [core:notice] [pid 46392] AH00052: child pid 32242 exit signal Segmentation fault (11) [09:16:25] I didn't restart it at 6:28 [09:17:57] well, it seems to be stable now and I can't see any clue about what went wrong. I'm going back to bed until it pages me again ;) [09:18:29] !log phabricator seems stable now, restarting apache2 on iridium did the trick, unfortunately we didn't learn why [09:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:54] mysql says 1000 "aborted clients" -> that means closed by the app, but that is expected for a restart [09:34:20] no alarm sent by the db as in previous instances [09:35:51] unrelated, there is lag on some secondary mysqls due to shnapshots ongoing [09:41:05] twentyafterfour: logrotate is the 6:28 UTC restart. it's expected [09:41:22] but seems to me like apache on iridium needs some love [09:43:00] like if this happens again, make sure we get core dumps [10:11:44] (03PS1) 10Mjbmr: Re-enable Flow for fawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233100 (https://phabricator.wikimedia.org/T109816) [10:13:14] akosiaris: we need to raise the 150 process limit and configure core dumps in the apache conf. I'll make a task [10:27:20] 6operations, 6Phabricator: apache on iridium "needs some love" - https://phabricator.wikimedia.org/T109941#1563287 (10mmodell) 3NEW a:3mmodell [10:42:30] I do not thing apache needs some love as much as phabricator needs it [10:42:35] *think [10:43:34] we are feeding him 1500 MySQL connections and now >150 processes, soon it will grow too large :-) [10:43:58] like a monster that we cannot control [10:45:11] from my side, it would be nice if we could do a read/write split- the mysql slave and the proxy are ready for that [11:11:30] can I get an ok for a global rename for a user with +25 k edits? [11:14:12] jynus: ^^ [11:14:51] mafk, how many? [11:15:10] jynus: hi, 25,745 edits and 183 accounts to be exact [11:15:50] mafk, one sec, let me check something [11:16:08] jynus: sure :) Thanks [11:16:55] mafk: it's always who you know to get responses and things done :) [11:17:16] :) [11:17:25] jynus: no te reconocĂ­a por cierto :D [11:17:40] do I know you? [11:17:53] well, we both from eswiki [11:17:56] oh, marcoaurelio [11:18:05] dferg in the past :) [11:18:13] didn't get with the IRC the nick [11:18:46] It's like mar.k logged in with a misspelled nick ;) [11:19:19] I am asking you to wait one sec, because had some nasty lag recently due to some bots [11:19:30] I want to check that it is fully gone [11:19:52] I'm not in a hurry [11:20:58] so, the processes are gone, but the lag may be here for around 38 minutes, that would be my only issue [11:21:17] specially if you are not in a hurry [11:21:41] is that ok? [11:22:04] sure, I can wait [11:22:06] mafk, [11:23:27] bug is https://phabricator.wikimedia.org/T109943 if someone is curious [11:33:27] gone for lunch [11:34:36] buen provecho :) [12:10:00] 6operations, 6Phabricator: apache on iridium "needs some love" - https://phabricator.wikimedia.org/T109941#1563396 (10chasemp) It seems the adding of our mass of repos has really changed the load. We could look at some of the preamble client throttling as well as bots have increased. [12:45:33] PROBLEM - puppet last run on ms-be1018 is CRITICAL puppet fail [12:58:08] ori or anyone who knows about HHVM: T109929 could use a look, it seems like it might be some sort of HHVM code-cache corruption or something like that, if that even makes sense. I don't know that it won't fix itself before Monday. [13:12:33] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:39:52] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL - Socket timeout after 10 seconds [14:40:22] wtf again? [14:42:37] segmentation fault again [14:43:14] (Phab gives me 503s. I guess that's known.) [14:43:18] mm [14:43:42] zend_mm_heap corrupted [14:43:42] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21162 bytes in 0.147 second response time [14:43:44] [Sat Aug 22 13:15:42.134495 2015] [core:notice] [pid 7700] AH00052: child pid 42285 exit signal Segmentation fault (11) [14:43:46] [Sat Aug 22 14:33:42.943854 2015] [mpm_prefork:error] [pid 7700] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting [14:44:21] when was it updated last time? [14:44:35] !log restarted apache2 on iridium. Segfault again. This time I at least got one clue in the log: "zend_mm_heap corrupted" [14:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:44] I remember quite some weeks ago, didn't it? [14:45:01] was what updated? [14:45:05] phab [14:46:39] what I am trying to say is that there is no connection [14:47:44] between a phab update and this outage? no connection, no... phab update hasn't happened recently [14:49:25] 6operations, 6Phabricator: apache on iridium "needs some love" - https://phabricator.wikimedia.org/T109941#1563567 (10mmodell) So it happened again, and this is what I found in the log: ``` zend_mm_heap corrupted [Sat Aug 22 13:15:42.134495 2015] [core:notice] [pid 7700] AH00052: child pid 42285 exit signal S... [14:54:26] 6operations, 6Phabricator: apache on iridium "needs some love" (triggers Phabricator 503s) - https://phabricator.wikimedia.org/T109941#1563586 (10Aklapper) [15:11:22] 6operations, 6Phabricator: apache on iridium "needs some love" (triggers Phabricator 503s) - https://phabricator.wikimedia.org/T109941#1563592 (10mmodell) There are some suggestions [[ http://stackoverflow.com/questions/2247977/what-does-zend-mm-heap-corrupted-mean | on stack overflow ]] Among them: > After... [15:14:07] 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1563593 (10mmodell) [15:26:56] 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1563599 (10greg) p:5Normal>3High [15:59:04] (03CR) 10Zfilipin: "Should RuboCop ignore everything in modules folder?" [puppet] - 10https://gerrit.wikimedia.org/r/226898 (owner: 10Faidon Liambotis) [16:23:54] PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures [16:32:59] !log raising values in mpm_worker.conf for iridium to to debug and hopefully head off further crashing [16:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:30] !log scratch that as we have mpm_prefork enabled :) [16:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:23] RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:54:27] 6operations, 10Adminbot: Upload new release of adminbot for Trusty - https://phabricator.wikimedia.org/T109947#1563644 (10scfc) 3NEW [16:59:53] <_joe_> chasemp: we can't use worker there as we're still on mod_php [17:00:09] :) I'm with it now [17:00:24] <_joe_> yeah sorry, just showed up now :) [17:25:41] 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1563664 (10chasemp) > >> `export USE_ZEND_ALLOC=0` > I am not sure about this setting. I have seen it used for debugging but I don't know what t... [17:28:15] !log tweaking apache on iridum T109941 [17:28:18] 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1563672 (10chasemp) I'm making notes to make this persistent on monday. [17:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:15] 6operations, 5Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#1563690 (10Krenair) [17:53:18] (03CR) 10Tim Landscheidt: "After submitting the patch, I thought about whether to subscribe sge_execd to changes of host_aliases would make sense to have those chang" [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt) [18:08:54] 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5Patch-For-Review, and 2 others: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1563705 (10Krenair) What's missing here still, @Joe? [18:09:16] 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5codfw-appserver-setup, 5wikis-in-codfw: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1563707 (10Krenair) [18:13:28] (03PS1) 10Southparkfan: Fix minor spelling mistake [puppet] - 10https://gerrit.wikimedia.org/r/233118 [18:23:44] (03CR) 10Merlijn van Deen: [C: 031] gridengine: Ensure that service gridengine-exec is running [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt) [19:31:36] I'm taking care of the labstore alerts btw [19:37:52] PROBLEM - puppet last run on mw1186 is CRITICAL Puppet has 1 failures [19:41:38] !log manually remove old snapshots from labstore1002 [19:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:33] RECOVERY - puppet last run on mw1186 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:13:18] 7Blocked-on-Operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1563836 (10yuvipanda) [20:13:37] 6operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1563855 (10yuvipanda) [20:28:37] 6operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1563887 (10yuvipanda) So this also means backups have been broken for about a week. [20:29:06] 6operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1563888 (10yuvipanda) p:5Triage>3High [21:08:51] (03CR) 10Yuvipanda: "People use this in labs, should've been in a production realm branch..." [puppet] - 10https://gerrit.wikimedia.org/r/231487 (owner: 10Ori.livneh) [21:09:23] (03CR) 10Yuvipanda: "(or in your bash profile :P)" [puppet] - 10https://gerrit.wikimedia.org/r/231487 (owner: 10Ori.livneh) [21:09:36] ori: I made some comments on https://gerrit.wikimedia.org/r/#/c/231487/ [21:11:50] (03PS2) 10Yuvipanda: Tools: Remove obsolete entries from host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/232968 (https://phabricator.wikimedia.org/T109871) (owner: 10Tim Landscheidt) [21:12:00] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/232968 (https://phabricator.wikimedia.org/T109871) (owner: 10Tim Landscheidt) [21:12:30] (03PS2) 10Yuvipanda: gridengine: Ensure that service gridengine-exec is running [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt) [21:12:38] (03CR) 10Yuvipanda: [C: 032 V: 032] gridengine: Ensure that service gridengine-exec is running [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt) [21:13:26] (03PS2) 10Yuvipanda: Tools: Only execute cdnjs-packages-gen on changes [puppet] - 10https://gerrit.wikimedia.org/r/232949 (owner: 10Tim Landscheidt) [21:13:34] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Only execute cdnjs-packages-gen on changes [puppet] - 10https://gerrit.wikimedia.org/r/232949 (owner: 10Tim Landscheidt) [21:13:39] (03PS1) 10Alex Monk: Revert "base: ensure => absent on 'command-not-found'" [puppet] - 10https://gerrit.wikimedia.org/r/233156 [22:26:01] 7Puppet, 6Labs, 5Patch-For-Review: Could not find data item labs_recursor - https://phabricator.wikimedia.org/T107205#1564451 (10scfc) 5Open>3Resolved a:3scfc The linked patch should have fixed this bug for new self-hosted puppetmasters; on your existing instance (worst case) this should require `sudo... [22:57:35] 7Puppet, 6Labs, 3Labs-Sprint-104, 3Labs-Sprint-105: Allow per-host hiera overrides via wikitech - https://phabricator.wikimedia.org/T104202#1564500 (10scfc) a:3scfc [23:08:29] !log krenair@tin Synchronized php-1.26wmf19/extensions/AbuseFilter/maintenance/addMissingLoggingEntries.php: (no message) (duration: 01m 05s) [23:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [23:35:52] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]