[00:08:05] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:09:55] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:58:30] oh, he's gone, I was going to +100 to asciinema to MaxSem [00:59:05] greg-g: too late, we'll all be using proprietary software instead to look at things typed via ssh into our production cluster! [00:59:07] mwahahaha! [01:01:46] asciinema is FLOSS? [01:02:16] * greg-g goes to get pizza and hang out with family [01:02:30] YuviPanda: welcome home, glad you're back safe [01:04:27] greg-g: :D [01:04:31] it isn't? [01:04:38] considering you two talking about it I thought it was [01:35:33] (03PS4) 10Ori.livneh: Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 [01:36:15] (03CR) 10Ori.livneh: "This is now ready for review." [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [01:37:06] (03PS5) 10Ori.livneh: Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 [02:11:33] !log LocalisationUpdate completed (1.23wmf2) at Sun Nov 10 02:11:33 UTC 2013 [02:11:56] Logged the message, Master [02:20:46] !log LocalisationUpdate completed (1.23wmf3) at Sun Nov 10 02:20:46 UTC 2013 [02:21:04] Logged the message, Master [02:52:22] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Nov 10 02:52:22 UTC 2013 [02:52:37] Logged the message, Master [04:02:12] PROBLEM - Puppet freshness on sq80 is CRITICAL: No successful Puppet run in the last 10 hours [04:54:42] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:43] (03CR) 10Ori.livneh: [C: 031] Disable slow UW queries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94404 (owner: 10MaxSem) [04:55:32] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [05:24:39] (03PS1) 10Ori.livneh: MW_RSYNC_ARGS: '--delay-updates' & '--compress' [operations/puppet] - 10https://gerrit.wikimedia.org/r/94591 [06:02:21] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:03:11] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2731173 seconds since restart [06:15:29] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [06:16:29] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [06:31:50] (03CR) 10TTO: [C: 031] "No objections after a month at https://sk.wikisource.org/wiki/Wikizdroje:Port%C3%A1l_komunity#Znemo.C5.BEn.C4.9Bn.C3.AD_lok.C3.A1ln.C3.ADh" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [07:06:10] (03PS3) 10Nemo bis: skwikisource: Restrict uploads to sysops [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [07:06:16] (03CR) 10jenkins-bot: [V: 04-1] skwikisource: Restrict uploads to sysops [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [07:06:27] (03CR) 10Nemo bis: "Thanks, added to summary. Needs rebase." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [07:35:21] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:11] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2736755 seconds since restart [07:45:57] (03Abandoned) 10Arav93: Renamed $wmfConfigDir to $wmgConfigDir in mediawiki-config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94354 (owner: 10Arav93) [08:17:31] (03PS1) 10Arav93: Renamed $wmf* to $wmg* [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 [08:36:12] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:02] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [08:58:22] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:12] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2741737 seconds since restart [09:43:46] PROBLEM - Puppet freshness on sq48 is CRITICAL: No successful Puppet run in the last 10 hours [09:48:26] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:16] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2744797 seconds since restart [09:53:06] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:26] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:56:56] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [10:00:06] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:26] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2745397 seconds since restart [10:00:56] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [10:11:21] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:12:11] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:13:01] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [10:14:11] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2746237 seconds since restart [10:57:21] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:58:11] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2748877 seconds since restart [11:09:58] (03PS2) 10Ori.livneh: Bumped jobqueue warning threshhold [operations/puppet] - 10https://gerrit.wikimedia.org/r/93585 (owner: 10Aaron Schulz) [11:10:15] (03CR) 10Ori.livneh: [C: 032] Bumped jobqueue warning threshhold [operations/puppet] - 10https://gerrit.wikimedia.org/r/93585 (owner: 10Aaron Schulz) [11:11:08] :( [12:01:22] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:02:22] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2752717 seconds since restart [13:35:50] (03CR) 10Nemo bis: "To your question on #wikimedia (please use #wikimedia-dev next time): you can clone it with git clone ssh://@gerrit.wikimedia.or" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [13:43:17] (03CR) 10Arav93: "I did that and changed the $wmf* to $wmg* , now should I upload that patch separately?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [13:43:27] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:17] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2758840 seconds since restart [13:48:21] (03CR) 10Nemo bis: "Yes, it's a separate repo so it will be a separate patch." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [13:49:07] (03PS1) 10Odder: (bug 56859) Update $wgCategoryCollation on iswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94607 [13:50:36] (03CR) 10Arav93: "Ok, thanks. :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [13:51:17] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:51:27] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:07] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [13:54:17] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2759440 seconds since restart [13:59:17] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:59:28] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:59:57] (03PS5) 10Nemo bis: Simplify misc::maintenance::update_special_pages a bit [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 [14:00:17] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:00:17] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2759800 seconds since restart [14:02:37] PROBLEM - Puppet freshness on sq80 is CRITICAL: No successful Puppet run in the last 10 hours [16:59:25] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:25] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.63 ms [19:04:14] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:59] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.099 second response time [19:44:39] PROBLEM - Puppet freshness on sq48 is CRITICAL: No successful Puppet run in the last 10 hours [19:53:53] (03PS1) 10QChris: Fix indentation in manifests/misc/statistic.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94624 [19:53:54] (03PS1) 10QChris: Restrict access to geowiki's data-private checkout [operations/puppet] - 10https://gerrit.wikimedia.org/r/94625 [19:53:55] (03PS1) 10QChris: Serve geowiki's private data through statistics websever [operations/puppet] - 10https://gerrit.wikimedia.org/r/94626 [20:25:10] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:49:00] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.071 second response time [21:34:10] manybubbles|away, PHP Warning: Search backend error during full text search for '"10.2139/ssrn.1825124"'. Error message is: SearchPhaseExecutionException[Failed to execute phase [dfs], all shards failed [21:34:20] this is rampaging in logs [22:39:17] (03PS1) 10Ryan Lane: Add shadow_reference support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94680 [22:40:39] (03CR) 10jenkins-bot: [V: 04-1] Add shadow_reference support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94680 (owner: 10Ryan Lane) [23:02:02] (03PS2) 10Ryan Lane: Add shadow_reference support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94680 [23:02:23] (03CR) 10Ryan Lane: [C: 04-1] "Work in progress, needs testing." [operations/puppet] - 10https://gerrit.wikimedia.org/r/94680 (owner: 10Ryan Lane) [23:13:01] !log Investigating possible site issue and logging everything that I come across. [23:13:22] Logged the message, Master [23:14:04] !log exception.log: Exception from line 61 of /usr/local/apache/common-local/php-1.23wmf2/includes/media/ImageHandler.php: No width specified to ImageHandler::makeParamString (194 such errors) [23:14:19] Logged the message, Master [23:14:51] !log exception.log: Exception from line 114 of /usr/local/apache/common-local/php-1.23wmf2/includes/upload/UploadStash.php: UploadStash::getFile No user is logged in, files must belong to users (8 such errors) [23:15:08] Logged the message, Master [23:15:35] hey ori-l [23:15:43] hey MaxSem [23:15:46] feel free to join in [23:16:03] my current approach: [23:16:04] 15:05 it's very difficult to diagnose issues when every single log you inspect is saturated with a different set of errors [23:16:04] 15:07 i'm trying to assess if something is the matter, based on scattered reports on #wikimedia-tech, but as usual there are a dozen or so alarms going off and i have no idea which ones to take seriously [23:16:05] 15:09 my plan is to !log everything i come across that is broken, regardless of whether or not it is causing the current site issue (if there is indeed one), because everything is implicated by virtue of obscuring the actual state of the site and by competing for attention. [23:16:10] do you have enough access to look up what 5xx sources are there? [23:16:24] Probably, if I knew where to look. [23:16:58] I just had talk with Dutch guys in #-tech who complained that their pages that hit parser limits are too slow to parse:P [23:17:19] !log exception.log: Exception from line 110 of /usr/local/apache/common-local/php-1.23wmf2/includes/WikiPage.php: Invalid or virtual namespace -1 given. (2 such errors) [23:17:32] varnishstat? [23:17:37] Logged the message, Master [23:17:56] !log PC LOAD LETTER [23:17:57] I'll take another minute to document what I'm seeing on fluorine and then check varnishstat [23:18:06] hi Elsie [23:18:12] Hi. [23:18:15] (03PS1) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [23:18:19] ori-l: ^^ [23:18:44] the two changes I just pushed in should generate localization cache during fetch/checkout stages on the minions [23:18:58] of course it's untested and is a work in progress :D [23:19:11] yay :) [23:19:21] but I think this is a good approach [23:19:21] (03CR) 10jenkins-bot: [V: 04-1] Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 (owner: 10Ryan Lane) [23:20:00] I added a feature that'll make a shadow reference of a repo. the fetch hook generates the cache in the shadow repo and the checkout rsyncs the files into the normal repo [23:20:19] since the shadow repo is kept up to date it should only change the cache when needed [23:20:49] the shadow repo uses git reference clone for the repo itself, but does full clones from the filesystem for the submodules [23:20:57] !log poolcounter.log: Pool counter is full (multiple wikis) [23:21:01] git submodules fucking suck [23:21:11] Logged the message, Master [23:21:37] Ryan_Lane: have you seen https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1day&from=-1%20day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=staircase&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22) ? [23:21:49] ewwwwww [23:22:03] that's from today? [23:22:20] and are those 5xx generated from apache or elsewhere? [23:22:43] Ryan_Lane, apache logs are clear [23:23:00] so this must be gateway errors or something [23:23:04] ori-l: any idea which datacenter is generating it? [23:23:30] both esams and ulsfo are good possibilities [23:24:09] !log fatal log: Fatal error: Allowed memory size of 201326592 bytes exhausted (tried to allocate 72 bytes) at /usr/local/apache/common-local/php-1.23wmf2/extensions/WikibaseDataModel/DataModel/Entity/Entity.php on line 130 (47 such fatals) [23:24:25] Logged the message, Master [23:26:19] !log redis.log: flooded with 'Used automatic re-authentication for Lua script [...]' (68,955 such messages) [23:26:25] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Text%2520caches%2520esams&tab=m&vn=&hide-hf=false [23:26:31] I'd say it's the text caches in esams [23:26:34] Logged the message, Master [23:26:46] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Text+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [23:27:40] yes, that's a pretty safe bet [23:27:57] though it doesn't match up perfectly [23:28:25] it does go with the traffic trend for esams, though [23:32:56] the upload cluster in esams has a similar looking profile [23:33:35] mobile has a really odd looking usage profile: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Mobile+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [23:34:47] upload in eqiad looks like it's handling things better: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Upload+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [23:35:29] yeah, i was just looking at mobile caches esams [23:36:10] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Mobile+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [23:37:18] packet loss [23:37:35] from esams to eqiad [23:37:46] on xe-4-2-2.cr1-eqiad.wikimedia.org [23:37:54] LeslieCarr, mark: ^^ [23:37:58] channel full or...? [23:37:59] I'm sure mark is asleep [23:38:18] I have no clue [23:38:33] my knowledge of the network isn't great [23:41:03] I think we should page Leslie, no? [23:41:17] yeah, I was about to. Was updating the email [23:41:32] and hoping she'd respond in the channel before I resorted to a page :) [23:42:06] I'm hoping to provide some visibility about how many things need to be fixed to the rest of the org so I'm being deliberately chatty with !logs [23:42:21] good [23:42:23] !log Per Ryan: packet loss from esams to eqiad on xe-4-2-2.cr1-eqiad.wikimedia.org [23:42:25] (hi all) [23:42:38] Logged the message, Master [23:42:47] not that anyone reads the SAL [23:42:49] !log CPU overload in text caches esams [23:42:58] ori-l, you really think anyone reads the SAL?:P [23:43:07] Logged the message, Master [23:43:10] well, both of you are wrong [23:43:13] because the ops team does [23:43:25] ori-l: thank you for doing the proper action [23:43:27] at least at the 'end of this' (is there?) he can copy paste into email being like: Look. Do something. Please. [23:43:41] 68,955 messages about Lua auth? [23:43:50] Are those just debug messages? [23:44:01] Ryan_Lane: yeah, I was mostly sarcastic, knowing ~only ops does, and these aren't all ops problems [23:44:13] this specific issue is ops problem :) [23:44:22] Elsie: yes, that is very suspicious. The context is that Aaron identified a bug in phpredis whereby it reconnects transparently if a connection drops *but* it doesn't reauthenticate [23:44:24] well, that one, but the other 10 ori has !log'd ;) [23:44:28] heh [23:44:39] Elsie: filed as https://github.com/nicolasff/phpredis/issues/403 [23:44:46] Fun. :-) [23:45:10] ok [23:45:11] pagers [23:45:12] :p [23:45:13] Aaron worked around it in MW code and added logging, but I don't think we expected it to be this common [23:45:19] LeslieCarr: :) [23:45:34] well, if I could fix network issues I'd do so and not need to page you [23:45:42] should i read SAL for the overview of what's up ? or anyone want to give me a 30 second (other than i am about to check the link right now) [23:45:51] so I'm sorry because my lack of knowledge interrupts you ;) [23:46:01] oh, wrong channel :D [23:46:25] LeslieCarr, packet loss on xe-4-2-2.cr1-eqiad.wikimedia.org [23:46:27] LeslieCarr: I can give you the what's up [23:46:38] we have a lot of 5xx errors in graphite [23:46:45] logging into that now [23:47:02] I tracked down high amounts of waitio and load on esams varnish [23:47:07] so my immediate thought was packet loss [23:47:26] I ran mtr from iron to cp3003 and no packet loss from eqiad to esams [23:47:36] I ran it the opposite way and there was packet loss [23:51:33] interesting [23:52:58] setting up some mtr's now [23:53:53] now that i'm looking at it, there's no packet loss on the mtr's [23:54:06] which sadly is often what happens :( [23:55:27] :( [23:55:42] indeed. I'm not seeing it now either [23:55:58] however, if this is happening occasionally it's probably enough to fuck with varnish [23:56:26] LeslieCarr: compare: [23:56:27] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1day&from=-1%20day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=staircase&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22) [23:56:30] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Mobile+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [23:56:39] varnish sucks and is highly susceptible to packet loss [23:57:23] :( http://gdash.wikimedia.org/dashboards/reqerror/ [23:57:39] heh. I had a 500 from gdash [23:58:18] it looks like the packet loss is happening every 10 minutes or so for short periods of time [23:59:38] at least, I'm basing that on the 500 graph