[00:00:59] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#1562853 (10Dzahn)
[00:01:16] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#1562855 (10Dzahn) p:5High>3Normal
[00:04:35] <bd808>	 thcipriani: you should really do ori a solid and promote that tool out of the bowels of our puppet repo into a proper project too
[00:07:10] <thcipriani>	 it is a really nice elegant solution to that particular problem.
[00:15:50] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1562866 (10Dzahn) **migration plan for mailman**  objectives:     - move away from server sodium (lucid) to server fermium (jessie) to get rid of the last lucid box in all of WMF     - upgrade m...
[00:26:42] <mutante>	 !log deleting blog.sh and blog_pageviews crontab from stat1003
[00:26:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:32:34] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1562900 (10Dzahn)
[00:34:15] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: setup rsyncd on fermium to copy files from sodium - https://phabricator.wikimedia.org/T109921#1562903 (10Dzahn) 3NEW a:3Dzahn
[00:35:50] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: setup rsyncd on fermium to copy files from sodium - https://phabricator.wikimedia.org/T109921#1562903 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/231190/ https://gerrit.wikimedia.org/r/#/c/231333/ https://gerrit.wikimedia.org/r/#/c/231394/ https://gerrit.wikimedia.org/r...
[00:35:56] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1562912 (10Dzahn)
[00:35:57] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: setup rsyncd on fermium to copy files from sodium - https://phabricator.wikimedia.org/T109921#1562911 (10Dzahn) 5Open>3Resolved
[00:37:30] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: write script to import mailing lists from other server - https://phabricator.wikimedia.org/T109922#1562914 (10Dzahn) 3NEW a:3Dzahn
[00:38:07] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: write script to import mailing lists from other server - https://phabricator.wikimedia.org/T109922#1562914 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/232287/2/modules/mailman/files/scripts/import_list.sh https://gerrit.wikimedia.org/r/#/c/232287/2/modules/mailman/files...
[00:38:13] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1562924 (10Dzahn)
[00:38:15] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: write script to import mailing lists from other server - https://phabricator.wikimedia.org/T109922#1562923 (10Dzahn) 5Open>3Resolved
[00:39:55] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: add public IP for fermium - DNS and DHCP change for reinstall - https://phabricator.wikimedia.org/T109923#1562925 (10Dzahn) 3NEW a:3Dzahn
[00:40:34] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with jessie and public IP - https://phabricator.wikimedia.org/T109924#1562935 (10Dzahn) 3NEW a:3Dzahn
[00:41:31] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1562942 (10Dzahn) 3NEW a:3Dzahn
[00:48:46] <grrrit-wm>	 (03PS1) 10BBlack: Fix remaining wikidata login issues: duplicate CA User [puppet] - 10https://gerrit.wikimedia.org/r/233086 (https://phabricator.wikimedia.org/T109038) 
[00:50:24] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] "Code is copypasta from directly above with just s/_Token/_User/, should be safe!" [puppet] - 10https://gerrit.wikimedia.org/r/233086 (https://phabricator.wikimedia.org/T109038) (owner: 10BBlack)
[01:02:08] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] cassandra: Mute strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/233073 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt)
[01:05:30] <wikibugs>	 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1562997 (10Dzahn) p:5Unbreak!>3High
[01:24:34] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 10.00% of data above the critical threshold [500.0]
[01:26:16] <ori>	 wtf was that?
[01:41:42] <bblack>	 no idea
[01:46:05] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[02:05:32] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
[02:05:33] <icinga-wm>	 PROBLEM - Last backup of the tools filesystem on labstore1002 is CRITICAL - Last run result was exit-code
[02:09:42] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on analytics1021 is OK Less than 1.00% above the threshold [1000000.0]
[02:20:53] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[02:20:55] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 06m 09s)
[02:21:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:29:12] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: gridengine: Ensure that service gridengine-exec is running [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) 
[02:31:52] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Tested on Toolsbeta; I scrapped toolsbeta-exec-201 and toolsbeta-exec-01 because they were fubar. It took me a while to understand that "" [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt)
[02:53:55] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0]
[03:02:12] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
[03:04:03] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18135 bytes in 1.043 second response time
[03:05:52] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[03:06:04] <icinga-wm>	 PROBLEM - Last backup of the others filesystem on labstore1002 is CRITICAL - Last run result was exit-code
[03:19:32] <icinga-wm>	 PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail
[03:44:42] <icinga-wm>	 RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures
[04:01:33] <icinga-wm>	 RECOVERY - Last backup of the maps filesystem on labstore1002 is OK - Last run successful
[05:46:53] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL 12.00% of data above the critical threshold [100000000.0]
[05:48:53] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0]
[06:26:32] <icinga-wm>	 PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 285 MB (3% inode=84%)
[06:28:24] <icinga-wm>	 RECOVERY - Disk space on iridium is OK: DISK OK
[06:29:45] <icinga-wm>	 PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail
[06:31:43] <icinga-wm>	 PROBLEM - puppet last run on db2044 is CRITICAL Puppet has 1 failures
[06:31:43] <icinga-wm>	 PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures
[06:32:32] <icinga-wm>	 PROBLEM - puppet last run on db2055 is CRITICAL Puppet has 1 failures
[06:32:33] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 2 failures
[06:32:33] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures
[06:32:33] <icinga-wm>	 PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures
[06:32:33] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures
[06:32:34] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures
[06:32:42] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures
[06:32:44] <icinga-wm>	 PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 2 failures
[06:56:24] <icinga-wm>	 RECOVERY - puppet last run on db2044 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures
[06:56:32] <icinga-wm>	 RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures
[06:57:14] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures
[06:57:22] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures
[06:57:22] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:22] <icinga-wm>	 RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:22] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:23] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:33] <icinga-wm>	 RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:32] <icinga-wm>	 RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures
[08:14:13] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL - Socket timeout after 10 seconds
[08:14:51] <twentyafterfour>	 I'm looking into ^
[08:16:10] <icinga-wm>	 ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on iridium is CRITICAL - Socket timeout after 10 seconds 20after4 looking into it
[08:17:54] <twentyafterfour>	 hmm seems like it's working..
[08:18:03] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21162 bytes in 0.232 second response time
[08:19:57] <akosiaris>	 twentyafterfour: need any help ?
[08:20:15] <akosiaris>	 still not working for me btw
[08:23:35] <akosiaris>	 twentyafterfour: look into error.log
[08:24:03] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL - Socket timeout after 10 seconds
[08:31:43] <twentyafterfour>	 apache error log?
[08:31:53] <twentyafterfour>	 akosiaris: I didn't see anything immediately obvious
[08:31:59] <twentyafterfour>	 server isn't overloaded
[08:32:19] <twentyafterfour>	 but apache is set to max 150 workers, I'm not sure if that's enough
[08:32:30] <akosiaris>	 twentyafterfour: As received by the server, this request had a nonzero content length but no POST data.\n\nNormally, this indicates that it exceeds the 'post_max_size' setting in the PHP configuration on the server. Increase the 'post_max_size' setting or reduce the size of the request.\n\nRequest size according to 'Content-Length' was '30', 'post_max_size' is set to '10M'.
[08:32:59] <akosiaris>	 obviously the recommendation to increase post_max_size is wrong
[08:32:59] <twentyafterfour>	 akosiaris: I see that all the time
[08:33:15] <akosiaris>	 really ? I looked before the incident and there isn't any for like hours 
[08:33:18] <twentyafterfour>	 if the content-length is '30' that's way less than the 10m limit
[08:33:47] <twentyafterfour>	 akosiaris: well I see those errors regularly when I look in the logs, I never analyzed how frequent they were or anything
[08:34:15] <twentyafterfour>	 I'm not sure what would cause that either
[08:34:24] <twentyafterfour>	 I'm gonna try restarting apache
[08:34:29] <akosiaris>	 malformed requests ?
[08:34:29] <twentyafterfour>	 just because I'm stumped
[08:34:32] <akosiaris>	 grep post_max_size phabricator_error.log.1  | wc -l
[08:34:32] <akosiaris>	 60
[08:34:32] <akosiaris>	 root@iridium:/var/log/apache2# grep post_max_size phabricator_error.log  | wc -l
[08:34:32] <akosiaris>	 543
[08:34:40] <twentyafterfour>	 akosiaris: yes malformed requests is what I had assumed it was
[08:34:41] <Nemo_bis>	  Request: GET http://phabricator.wikimedia.org/maniphest/task/create/, from 10.64.0.106 via cp1069 cp1069 ([10.64.0.106]:80), Varnish XID 1377408704 Error: 503, Service Unavailable at Sat, 22 Aug 2015 08:22:51 GMT 
[08:34:42] <icinga-wm>	 PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd)
[08:34:49] <akosiaris>	 Nemo_bis: yeah, known
[08:34:55] <Nemo_bis>	 Is what the user sees, if useful
[08:35:00] <Nemo_bis>	 Yeah I saw
[08:35:02] <akosiaris>	 twentyafterfour: yeah, I concur. restart apache
[08:35:23] <akosiaris>	 twentyafterfour: did you stop phd or it just died 
[08:35:26] <akosiaris>	 ?
[08:35:43] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21162 bytes in 0.226 second response time
[08:35:44] <twentyafterfour>	 stopped it
[08:35:59] <twentyafterfour>	 !log restarted apache2 on iridium
[08:36:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:36:16] <twentyafterfour>	 !log restarted phd on iridium
[08:36:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:36:43] <icinga-wm>	 RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 8 processes with UID = 997 (phd)
[08:36:44] <twentyafterfour>	 apache was using the max 150 processes, I think we need to increase the limit in mpm_worker.conf
[08:37:16] <twentyafterfour>	 now it's using 70
[08:37:36] <akosiaris>	 twentyafterfour: [Sat Aug 22 08:35:37.858581 2015] [core:notice] [pid 46392] AH00052: child pid 5083 exit signal Segmentation fault (11)
[08:37:46] <twentyafterfour>	 hmmm
[08:37:58] * paravoid lurks
[08:38:02] <akosiaris>	 ok, that would explain the problem. now what's causing the segfault
[08:38:07] <paravoid>	 lemme know if you need any help
[08:38:22] <twentyafterfour>	 was that an apache worker segfaulting?
[08:39:01] * twentyafterfour hasn't seen an apache segfault in a few years. those used to be interesting to debug
[08:39:16] <akosiaris>	 should be. logged in /var/log/apache2/error.log so I suppose master logged the child segfaulting ?
[08:39:34] <akosiaris>	 also
[08:39:37] <akosiaris>	 [34555253.446779] do_IRQ: 6.112 No irq handler for vector (irq -1)
[08:39:39] <akosiaris>	 in dmesg
[08:39:43] <akosiaris>	 this is weird
[08:39:45] <twentyafterfour>	 hmm
[08:39:46] <akosiaris>	 first time I see this
[08:40:01] <paravoid>	 that's fine, ignore it
[08:40:28] <twentyafterfour>	 the apache segfault is most likely actually a php segfault, right?
[08:40:29] <akosiaris>	 yeah, I just looked at the timestamps
[08:40:38] <akosiaris>	 the do_IRQ thing it's unrelated 
[08:40:53] <akosiaris>	 twentyafterfour: that would be my guess right now
[08:41:32] <twentyafterfour>	 the output of iostat looks a little weird to me
[08:43:22] <twentyafterfour>	 md0 and md2 should be seeing more reads, I would think... and what is md1, seems to be unused but it's allocated
[08:44:48] <twentyafterfour>	 syslog has messages from add dates interspersed with current messages
[08:45:55] <twentyafterfour>	 Aug  2 05:05:01 iridium CRON[31032]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
[08:45:57] <twentyafterfour>	 Aug 22 08:41:00 iridium kernel: [34563405.732872] ERST: NVRAM ERST Log Address Range not implemented yet.
[08:50:01] <akosiaris>	 this is weird.. the restart obviously fixed the issue whatever it was
[08:50:11] <akosiaris>	 and we never got a corefile to gdb it
[08:58:08] <twentyafterfour>	 Yeah agreed it's weird and a little uncomfortable
[09:08:41] <twentyafterfour>	 I wonder why there was no core dump. Is core dumping disabled? /me forgets what controls that
[09:09:51] <twentyafterfour>	 hmm, needs to be configured with CoreDumpDirectory directive
[09:15:54] <twentyafterfour>	 There is also a strange log entry, right before all the segfaults
[09:16:12] <twentyafterfour>	 [Sat Aug 22 06:28:27.664577 2015] [mpm_prefork:notice] [pid 46392] AH00163: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.11 configured -- resuming normal operations
[09:16:14] <twentyafterfour>	 [Sat Aug 22 06:28:27.664653 2015] [core:notice] [pid 46392] AH00094: Command line: '/usr/sbin/apache2'
[09:16:16] <twentyafterfour>	 [Sat Aug 22 08:35:37.791160 2015] [core:notice] [pid 46392] AH00052: child pid 32242 exit signal Segmentation fault (11)
[09:16:25] <twentyafterfour>	 I didn't restart it at 6:28
[09:17:57] <twentyafterfour>	 well, it seems to be stable now and I can't see any clue about what went wrong. I'm going back to bed until it pages me again ;)
[09:18:29] <twentyafterfour>	 !log phabricator seems stable now, restarting apache2 on iridium did the trick, unfortunately we didn't learn why
[09:18:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:33:54] <jynus>	 mysql says 1000 "aborted clients" -> that means closed by the app, but that is expected for a restart
[09:34:20] <jynus>	 no alarm sent by the db as in previous instances
[09:35:51] <jynus>	 unrelated, there is lag on some secondary mysqls due to shnapshots ongoing
[09:41:05] <akosiaris>	 twentyafterfour: logrotate is the 6:28 UTC restart. it's expected
[09:41:22] <akosiaris>	 but seems to me like apache on iridium needs some love
[09:43:00] <akosiaris>	 like if this happens again, make sure we get core dumps
[10:11:44] <grrrit-wm>	 (03PS1) 10Mjbmr: Re-enable Flow for fawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233100 (https://phabricator.wikimedia.org/T109816) 
[10:13:14] <twentyafterfour>	 akosiaris: we need to raise the 150 process limit and configure core dumps in the apache conf. I'll make a task
[10:27:20] <wikibugs>	 6operations, 6Phabricator: apache on iridium "needs some love" - https://phabricator.wikimedia.org/T109941#1563287 (10mmodell) 3NEW a:3mmodell
[10:42:30] <jynus>	 I do not thing apache needs some love as much as phabricator needs it
[10:42:35] <jynus>	 *think
[10:43:34] <jynus>	 we are feeding him 1500 MySQL connections and now >150 processes, soon it will grow too large :-)
[10:43:58] <jynus>	 like a monster that we cannot control
[10:45:11] <jynus>	 from my side, it would be nice if we could do a read/write split- the mysql slave and the proxy are ready for that
[11:11:30] <mafk>	 can I get an ok for a global rename for a user with +25 k edits?
[11:14:12] <JohnFLewis>	 jynus: ^^
[11:14:51] <jynus>	 mafk, how many?
[11:15:10] <mafk>	 jynus: hi, 25,745 edits and 183 accounts to be exact
[11:15:50] <jynus>	 mafk, one sec, let me check something
[11:16:08] <mafk>	 jynus: sure :) Thanks
[11:16:55] <JohnFLewis>	 mafk: it's always who you know to get responses and things done :)
[11:17:16] <mafk>	 :)
[11:17:25] <mafk>	 jynus: no te reconocía por cierto :D
[11:17:40] <jynus>	 do I know you?
[11:17:53] <mafk>	 well, we both from eswiki
[11:17:56] <jynus>	 oh, marcoaurelio
[11:18:05] <mafk>	 dferg in the past :)
[11:18:13] <jynus>	 didn't get with the IRC the nick
[11:18:46] <JohnFLewis>	 It's like mar.k logged in with a misspelled nick ;)
[11:19:19] <jynus>	 I am asking you to wait one sec, because had some nasty lag recently due to some bots
[11:19:30] <jynus>	 I want to check that it is fully gone
[11:19:52] <mafk>	 I'm not in a hurry
[11:20:58] <jynus>	 so, the processes are gone, but the lag may be here for around 38 minutes, that would be my only issue
[11:21:17] <jynus>	 specially if you are not in a hurry
[11:21:41] <jynus>	 is that ok?
[11:22:04] <mafk>	 sure, I can wait
[11:22:06] <jynus>	 mafk, 
[11:23:27] <jynus>	 bug is https://phabricator.wikimedia.org/T109943 if someone is curious
[11:33:27] <jynus>	 gone for lunch
[11:34:36] <mafk>	 buen provecho :)
[12:10:00] <wikibugs>	 6operations, 6Phabricator: apache on iridium "needs some love" - https://phabricator.wikimedia.org/T109941#1563396 (10chasemp) It seems the adding of our mass of repos has really changed the load. We could look at some of the preamble client throttling as well as bots have increased.
[12:45:33] <icinga-wm>	 PROBLEM - puppet last run on ms-be1018 is CRITICAL puppet fail
[12:58:08] <anomie>	 ori or anyone who knows about HHVM: T109929 could use a look, it seems like it might be some sort of HHVM code-cache corruption or something like that, if that even makes sense. I don't know that it won't fix itself before Monday.
[13:12:33] <icinga-wm>	 RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:39:52] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL - Socket timeout after 10 seconds
[14:40:22] <twentyafterfour>	 wtf again?
[14:42:37] <twentyafterfour>	 segmentation fault again
[14:43:14] <andre__>	 (Phab gives me 503s. I guess that's known.)
[14:43:18] <jynus>	 mm
[14:43:42] <twentyafterfour>	 zend_mm_heap corrupted
[14:43:42] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 21162 bytes in 0.147 second response time
[14:43:44] <twentyafterfour>	 [Sat Aug 22 13:15:42.134495 2015] [core:notice] [pid 7700] AH00052: child pid 42285 exit signal Segmentation fault (11)
[14:43:46] <twentyafterfour>	 [Sat Aug 22 14:33:42.943854 2015] [mpm_prefork:error] [pid 7700] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
[14:44:21] <jynus>	 when was it updated last time?
[14:44:35] <twentyafterfour>	 !log restarted apache2 on iridium.  Segfault again. This time I at least got one clue in the log:  "zend_mm_heap corrupted"
[14:44:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:44:44] <jynus>	 I remember quite some weeks ago, didn't it?
[14:45:01] <twentyafterfour>	 was what updated?
[14:45:05] <jynus>	 phab
[14:46:39] <jynus>	 what I am trying to say is that there is no connection
[14:47:44] <twentyafterfour>	 between a phab update and this outage?  no connection, no... phab update hasn't happened recently
[14:49:25] <wikibugs>	 6operations, 6Phabricator: apache on iridium "needs some love" - https://phabricator.wikimedia.org/T109941#1563567 (10mmodell) So it happened again, and this is what I found in the log:  ``` zend_mm_heap corrupted [Sat Aug 22 13:15:42.134495 2015] [core:notice] [pid 7700] AH00052: child pid 42285 exit signal S...
[14:54:26] <wikibugs>	 6operations, 6Phabricator: apache on iridium "needs some love" (triggers Phabricator 503s) - https://phabricator.wikimedia.org/T109941#1563586 (10Aklapper)
[15:11:22] <wikibugs>	 6operations, 6Phabricator: apache on iridium "needs some love" (triggers Phabricator 503s) - https://phabricator.wikimedia.org/T109941#1563592 (10mmodell) There are some suggestions [[ http://stackoverflow.com/questions/2247977/what-does-zend-mm-heap-corrupted-mean | on stack overflow ]]  Among them:  > After...
[15:14:07] <wikibugs>	 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1563593 (10mmodell)
[15:26:56] <wikibugs>	 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1563599 (10greg) p:5Normal>3High
[15:59:04] <grrrit-wm>	 (03CR) 10Zfilipin: "Should RuboCop ignore everything in modules folder?" [puppet] - 10https://gerrit.wikimedia.org/r/226898 (owner: 10Faidon Liambotis)
[16:23:54] <icinga-wm>	 PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures
[16:32:59] <chasemp>	 !log raising values in mpm_worker.conf for iridium to to debug and hopefully head off further crashing
[16:33:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:45:30] <chasemp>	 !log scratch that as we have mpm_prefork enabled :)
[16:45:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:49:23] <icinga-wm>	 RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures
[16:54:27] <wikibugs>	 6operations, 10Adminbot: Upload new release of adminbot for Trusty - https://phabricator.wikimedia.org/T109947#1563644 (10scfc) 3NEW
[16:59:53] <_joe_>	 chasemp: we can't use worker there as we're still on mod_php
[17:00:09] <chasemp>	 :) I'm with it now
[17:00:24] <_joe_>	 yeah sorry, just showed up now :)
[17:25:41] <wikibugs>	 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1563664 (10chasemp)  >  >> `export USE_ZEND_ALLOC=0` >   I am not sure about this setting.  I have seen it used for debugging but I don't know what t...
[17:28:15] <chasemp>	 !log tweaking apache on iridum T109941
[17:28:18] <wikibugs>	 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1563672 (10chasemp) I'm making notes to make this persistent on monday.
[17:28:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:44:15] <wikibugs>	 6operations, 5Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#1563690 (10Krenair)
[17:53:18] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "After submitting the patch, I thought about whether to subscribe sge_execd to changes of host_aliases would make sense to have those chang" [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt)
[18:08:54] <wikibugs>	 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5Patch-For-Review, and 2 others: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1563705 (10Krenair) What's missing here still, @Joe?
[18:09:16] <wikibugs>	 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5codfw-appserver-setup, 5wikis-in-codfw: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1563707 (10Krenair)
[18:13:28] <grrrit-wm>	 (03PS1) 10Southparkfan: Fix minor spelling mistake [puppet] - 10https://gerrit.wikimedia.org/r/233118 
[18:23:44] <grrrit-wm>	 (03CR) 10Merlijn van Deen: [C: 031] gridengine: Ensure that service gridengine-exec is running [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt)
[19:31:36] <YuviPanda>	 I'm taking care of the labstore alerts btw
[19:37:52] <icinga-wm>	 PROBLEM - puppet last run on mw1186 is CRITICAL Puppet has 1 failures
[19:41:38] <YuviPanda>	 !log manually remove old snapshots from labstore1002
[19:41:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:03:33] <icinga-wm>	 RECOVERY - puppet last run on mw1186 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures
[20:13:18] <wikibugs>	 7Blocked-on-Operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1563836 (10yuvipanda)
[20:13:37] <wikibugs>	 6operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1563855 (10yuvipanda)
[20:28:37] <wikibugs>	 6operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1563887 (10yuvipanda) So this also means backups have been broken for about a week.
[20:29:06] <wikibugs>	 6operations, 6Labs: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1563888 (10yuvipanda) p:5Triage>3High
[21:08:51] <grrrit-wm>	 (03CR) 10Yuvipanda: "People use this in labs, should've been in a production realm branch..." [puppet] - 10https://gerrit.wikimedia.org/r/231487 (owner: 10Ori.livneh)
[21:09:23] <grrrit-wm>	 (03CR) 10Yuvipanda: "(or in your bash profile :P)" [puppet] - 10https://gerrit.wikimedia.org/r/231487 (owner: 10Ori.livneh)
[21:09:36] <YuviPanda>	 ori: I made some comments on https://gerrit.wikimedia.org/r/#/c/231487/
[21:11:50] <grrrit-wm>	 (03PS2) 10Yuvipanda: Tools: Remove obsolete entries from host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/232968 (https://phabricator.wikimedia.org/T109871) (owner: 10Tim Landscheidt)
[21:12:00] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/232968 (https://phabricator.wikimedia.org/T109871) (owner: 10Tim Landscheidt)
[21:12:30] <grrrit-wm>	 (03PS2) 10Yuvipanda: gridengine: Ensure that service gridengine-exec is running [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt)
[21:12:38] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] gridengine: Ensure that service gridengine-exec is running [puppet] - 10https://gerrit.wikimedia.org/r/233087 (https://phabricator.wikimedia.org/T109728) (owner: 10Tim Landscheidt)
[21:13:26] <grrrit-wm>	 (03PS2) 10Yuvipanda: Tools: Only execute cdnjs-packages-gen on changes [puppet] - 10https://gerrit.wikimedia.org/r/232949 (owner: 10Tim Landscheidt)
[21:13:34] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Only execute cdnjs-packages-gen on changes [puppet] - 10https://gerrit.wikimedia.org/r/232949 (owner: 10Tim Landscheidt)
[21:13:39] <grrrit-wm>	 (03PS1) 10Alex Monk: Revert "base: ensure => absent on 'command-not-found'" [puppet] - 10https://gerrit.wikimedia.org/r/233156 
[22:26:01] <wikibugs>	 7Puppet, 6Labs, 5Patch-For-Review: Could not find data item labs_recursor - https://phabricator.wikimedia.org/T107205#1564451 (10scfc) 5Open>3Resolved a:3scfc The linked patch should have fixed this bug for new self-hosted puppetmasters; on your existing instance (worst case) this should require `sudo...
[22:57:35] <wikibugs>	 7Puppet, 6Labs, 3Labs-Sprint-104, 3Labs-Sprint-105: Allow per-host hiera overrides via wikitech - https://phabricator.wikimedia.org/T104202#1564500 (10scfc) a:3scfc
[23:08:29] <logmsgbot>	 !log krenair@tin Synchronized php-1.26wmf19/extensions/AbuseFilter/maintenance/addMissingLoggingEntries.php: (no message) (duration: 01m 05s)
[23:08:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:23:53] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0]
[23:35:52] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]