[02:12:19] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3790 MB (3% inode=99%): [02:13:53] !log LocalisationUpdate completed (1.24wmf3) at 2014-05-11 02:12:49+00:00 [02:14:04] Logged the message, Master [02:21:19] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3433 MB (3% inode=99%): [02:23:06] !log LocalisationUpdate completed (1.24wmf4) at 2014-05-11 02:22:02+00:00 [02:23:12] Logged the message, Master [02:41:09] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [03:00:19] RECOVERY - Disk space on virt0 is OK: DISK OK [03:10:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data exceeded the critical threshold [500.0] [03:11:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun May 11 03:10:37 UTC 2014 (duration 10m 36s) [03:11:50] Logged the message, Master [03:18:19] Got a page...seems not meaningful, should tweak anomaly stuff to be only if paging isn't actionable right now [03:19:05] Email only I meant [03:24:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [04:02:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [07:54:00] (03CR) 10Hoo man: [C: 04-1] "Looks good in general" (035 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130274 (https://bugzilla.wikimedia.org/64255) (owner: 10Gerrit Patch Uploader) [08:13:33] csteipp: the better gpg keyfob: https://www.assembla.com/spaces/cryptostick/wiki and the other one i spoke about: www.ftsafe.com/product/epass/epass2003 [08:31:51] ori, sumanah would love to have you in room 2 to talk about performance guidelines if you're around [09:26:25] (03PS4) 10Hoo man: Run rebuildEntityPerPage.php on Wikidata (once per week) [operations/puppet] - 10https://gerrit.wikimedia.org/r/120535 [09:29:50] (03PS5) 10Hoo man: Run rebuildEntityPerPage.php on Wikidata (once per week) [operations/puppet] - 10https://gerrit.wikimedia.org/r/120535 [11:16:08] (03PS1) 10TheDJ: Remove old WikiEditor settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132793 [11:17:04] (03PS2) 10Ori.livneh: Move rcstream server implementation to external repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/132429 [11:18:45] (03PS2) 10Ori.livneh: Move diamond::generic to manifests/ and lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 [11:19:17] (03CR) 10Ori.livneh: "Ok. I'll amend it to make the changes to the parameter types to the file in its current location." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 (owner: 10Ori.livneh) [11:19:31] (03PS4) 10Ori.livneh: Tidy ::applicationserver & ::applicationserver::pybal_check [operations/puppet] - 10https://gerrit.wikimedia.org/r/132217 [11:19:39] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [11:20:06] (03PS1) 10Hoo man: Remove long absent Wikidata crons from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/132794 [11:20:08] (03PS1) 10Hoo man: Run Wikidata maint. scripts as apache instead of mwdeploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/132795 [11:20:09] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [11:20:09] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [11:20:09] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [11:20:09] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [11:20:09] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [11:20:09] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:20:25] ottomata: ^ [11:21:59] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 443 bytes in 0.053 second response time [11:21:59] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 443 bytes in 0.078 second response time [11:21:59] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 443 bytes in 0.071 second response time [11:21:59] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 443 bytes in 0.080 second response time [11:21:59] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 68115 bytes in 0.269 second response time [11:22:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [11:22:29] (03CR) 10Ottomata: [C: 032 V: 032] Remove long absent Wikidata crons from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/132794 (owner: 10Hoo man) [11:22:29] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 443 bytes in 0.059 second response time [11:22:58] So what in the heck was that about [11:22:59] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 443 bytes in 0.066 second response time [11:24:31] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Swift%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1399807327&g=network_report&z=large [11:24:44] (03CR) 10Ottomata: [C: 032 V: 032] Run Wikidata maint. scripts as apache instead of mwdeploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/132795 (owner: 10Hoo man) [11:32:48] Wow, thanks. Serious spike [11:33:09] swift's network got saturated [11:33:26] there seems to be a more or less equivalent network spike on imagescalers [11:34:22] so this suggests the imagescaler cluster requested one or more large images from swift at about the same time [11:34:37] with multiple requests, since all of the imagescalers show this spike and all of the swift proxies as ewll [11:35:19] PoolCounter :) [11:36:15] nypl again [11:36:24] GET /v1/AUTH_mw/wikipedia-commons-local-public.29/2/29/Bronx,_V._12,_Double_Page_Plate_No._273_%28Map_bounded_by_Whiting_Ave.,_Ewen_Ave.,_Warren_Ave.,_Hudson_River%29_NYPL2001533.tiff: [11:36:28] ffs [11:37:00] Multiple requests for the same file to scale? [11:37:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [11:38:28] 11/May/2014/11/36/06 PUT /v1/AUTH_mw/wikipedia-commons-local-public.bc/b/bc/Manhattan%252C_V._11%252C_Plate_No._2_%2528Map_bounded_by_12th_Ave.%252C_W._133rd_St.%252C_Broadway%252C_W._130th_St.%2529_NYPL1995957.tiff [11:38:36] that looks like an upload doesn't it [11:38:55] must be with the put [11:39:08] https://commons.wikimedia.org/wiki/Special:Contributions/F%C3%A6 [11:41:00] the good news is, it has been going on for hours [11:41:43] oh wait, these were metadata updates [11:42:22] https://commons.wikimedia.org/wiki/Special:ListFiles/F%C3%A6 [11:43:16] 50 in the last couple of ideas [11:43:20] ideas? hours [11:52:19] # grep -hr filename * | grep NYPL | sort -u |wc -l [11:52:20] 7396 [11:52:22] yeah okay [12:05:53] chasemp: placed a workaround and sent an update to ops@ [12:06:02] going back to the hackathon stuff now [12:07:13] Awesome thank you for keeping me in the loop, even though I'm not too useful on it [12:33:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:29:07 2014 [12:35:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:29:07 2014 [12:37:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:29:07 2014 [12:39:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:29:07 2014 [12:40:34] we get it [12:41:51] :-) [12:41:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:29:07 2014 [12:42:09] * andrewbogott is looking but only has five mins [12:43:08] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Sun May 11 12:43:02 UTC 2014 [12:43:46] ooh, it looks like it's my fault too :( [12:43:51] hah! [12:44:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:43:02 2014 [12:46:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:43:02 2014 [12:47:59] The fact that it's telling us every five minutes though… I don't think that part is my fault :) [12:48:48] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Sun May 11 12:48:41 UTC 2014 [12:50:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:48:41 2014 [12:52:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:48:41 2014 [12:54:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:48:41 2014 [12:55:08] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [12:56:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:48:41 2014 [12:58:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 12:48:41 2014 [13:00:08] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Sun May 11 13:00:03 UTC 2014 [13:01:58] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Sun May 11 13:00:03 2014 [13:08:04] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Sun May 11 13:07:59 UTC 2014 [13:11:08] (03PS1) 10Yuvipanda: toollabs: Add gdal-bin to the exec environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/132803 [13:11:12] andrewbogott: Coren ^ minor patch? [13:12:20] YuviPanda: is there a bug for that? Might be good to keep a trail of what/why [13:12:34] andrewbogott: yeah, moment [13:12:51] (03PS2) 10Yuvipanda: toollabs: Add gdal-bin to the exec environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/132803 (https://bugzilla.wikimedia.org/65123) [13:12:52] andrewbogott: done [13:13:05] that was the only missing package I think [13:13:34] YuviPanda: I think we're going to give the wikiatlas people their own project, so this is moot ftm [13:13:59] andrewbogott: no I am sitting next to them right now and they don't need their own project. [13:14:03] this was the only thing missing [13:14:19] andrewbogott: plus once I explained that they will have to spend time administering it themselves and it is much easier with tools they were happy with tools [13:14:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [13:14:26] let me comment on the bug as well [13:14:27] ok then :) [13:15:26] (03CR) 10Andrew Bogott: [C: 032] toollabs: Add gdal-bin to the exec environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/132803 (https://bugzilla.wikimedia.org/65123) (owner: 10Yuvipanda) [13:16:05] andrewbogott: ty [14:06:31] akosiaris: the vm is named trusty-test-puppetmaster [14:21:56] !log power cycling asw-d5-eqiad [14:22:03] Logged the message, Master [14:26:24] RECOVERY - Host mw1208 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [14:26:24] RECOVERY - Host mw1201 is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [14:26:24] RECOVERY - Host mw1209 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [14:26:24] RECOVERY - Host mw1203 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:26:34] RECOVERY - Host mw1210 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [14:26:44] RECOVERY - check configured eth on lvs1002 is OK: NRPE: Unable to read output [14:26:44] RECOVERY - Host mw1202 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [14:26:44] RECOVERY - check configured eth on lvs1003 is OK: NRPE: Unable to read output [14:27:04] RECOVERY - check configured eth on lvs1001 is OK: NRPE: Unable to read output [14:33:14] PROBLEM - Puppet freshness on mw1203 is CRITICAL: Last successful Puppet run was Sat May 10 08:23:53 2014 [14:45:24] godog: here? [14:54:14] RECOVERY - Puppet freshness on mw1203 is OK: puppet ran at Sun May 11 14:54:07 UTC 2014 [14:55:17] (03CR) 10Andrew Bogott: [C: 032] Include the labs_initial_content role in labs_vagrant. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132721 (owner: 10Andrew Bogott) [15:07:29] YuviPanda: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Connecting_to_OSM_via_the_official_CLI_PostgreSQL [15:14:08] akosiaris: did you have time to think of role::firewall ? [15:14:28] paravoid: What does 'brief' mean? [15:14:52] (in your e-mail to multimedia@lists.wm.org) [15:16:17] Oh. Log says two minutes. [15:17:29] twkozlowski: nice duration to report on the tech news for once ;) [15:19:19] (03PS1) 10Alexandros Kosiaris: Create a wikimaps_atlas postgis database [operations/puppet] - 10https://gerrit.wikimedia.org/r/132813 (https://bugzilla.wikimedia.org/63382) [15:21:49] Nemo_bis: My thought exactly, but that'll have to wait till next week [15:22:37] 'For two minutes on May 11, there were problems with image scaling due to a high server load.' [15:23:34] !log reedy synchronized php-1.24wmf4/thumb.php [15:23:40] Logged the message, Master [15:24:37] twkozlowski: I think you can rewrite it to be more positive :P [15:24:38] (03CR) 10Alexandros Kosiaris: [C: 032] Create a wikimaps_atlas postgis database [operations/puppet] - 10https://gerrit.wikimedia.org/r/132813 (https://bugzilla.wikimedia.org/63382) (owner: 10Alexandros Kosiaris) [15:25:57] 'For just two minutes on May 11, there were unnoticeable problems with scaling of a small number of files due to an excessively high server load.' [15:26:12] !log reedy synchronized php-1.24wmf3/thumb.php [15:26:20] Logged the message, Master [15:30:59] (03PS1) 10Alexandros Kosiaris: Fix a typo (planembad=>planemad) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132814 [15:32:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix a typo (planembad=>planemad) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132814 (owner: 10Alexandros Kosiaris) [15:43:23] (03CR) 10Ori.livneh: [C: 032] Tidy ::applicationserver & ::applicationserver::pybal_check [operations/puppet] - 10https://gerrit.wikimedia.org/r/132217 (owner: 10Ori.livneh) [15:50:49] (03CR) 10JanZerebecki: "There should be no problem with the possible slight increase in CPU load as the affected clusters aren't AFAIK utilized to that extend:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [15:57:44] PROBLEM - NTP on mw1208 is CRITICAL: NTP CRITICAL: Offset unknown [16:03:44] RECOVERY - NTP on mw1208 is OK: NTP OK: Offset 0.005566000938 secs [16:12:27] (03CR) 10JanZerebecki: "In case the HSTS causes undesirable effects (more HTTPS users than could be expected) it can be reversed with max-age=0 ." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [16:20:57] (03CR) 10JanZerebecki: "This needs a files/ssl/dhparam.pem in the private repository generated by using: openssl dhparam 2048" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [16:47:14] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Sun May 11 13:46:58 2014 [16:47:34] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Sun May 11 16:47:31 UTC 2014 [18:19:34] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [20:00:44] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [22:50:44] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [23:00:45] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds