[00:12:32] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [00:52:22] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 348 seconds [00:52:58] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 383 seconds [00:53:48] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:09] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [01:08:41] (03PS1) 10Plucas: Make the metrics polling interval configurable [puppet/kafka] - 10https://gerrit.wikimedia.org/r/168528 [01:53:01] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [02:05:09] RECOVERY - Disk space on ocg1001 is OK: DISK OK [02:07:55] (03PS1) 10Springle: ocg log fills up faster than daily cycle [puppet] - 10https://gerrit.wikimedia.org/r/168536 [02:20:02] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 1 failures [02:23:49] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures [02:24:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [02:36:41] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [02:41:30] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:46:15] !log LocalisationUpdate completed (1.25wmf4) at 2014-10-24 02:46:15+00:00 [02:46:26] Logged the message, Master [03:01:20] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures [03:04:10] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [03:19:49] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:22:36] !log LocalisationUpdate completed (1.25wmf5) at 2014-10-24 03:22:36+00:00 [03:22:40] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:22:43] Logged the message, Master [04:20:40] (03PS1) 10Ori.livneh: HHVM: report memory stats to Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/168538 [04:24:08] (03CR) 10Ori.livneh: [C: 032] HHVM: report memory stats to Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/168538 (owner: 10Ori.livneh) [05:04:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 24 05:04:33 UTC 2014 (duration 4m 32s) [05:04:42] Logged the message, Master [06:29:19] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:38] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:09] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:39] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:51:01] (03PS1) 10Gage: logstash filter: gelf: hadoop: update for logstash 1.4.2 [puppet] - 10https://gerrit.wikimedia.org/r/168548 [06:52:25] (03CR) 10Gage: [C: 032] logstash filter: gelf: hadoop: update for logstash 1.4.2 [puppet] - 10https://gerrit.wikimedia.org/r/168548 (owner: 10Gage) [07:26:25] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [07:36:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [07:42:35] (03PS1) 10Alexandros Kosiaris: osm.py: Remove some debug statements [puppet] - 10https://gerrit.wikimedia.org/r/168555 [07:44:46] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:46:00] (03PS2) 10Alexandros Kosiaris: osm.py: Remove some debug statements [puppet] - 10https://gerrit.wikimedia.org/r/168555 [07:50:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:56:24] (03CR) 10Alexandros Kosiaris: [C: 032] osm.py: Remove some debug statements [puppet] - 10https://gerrit.wikimedia.org/r/168555 (owner: 10Alexandros Kosiaris) [08:04:59] (03CR) 10Giuseppe Lavagetto: [C: 031] "one nitpick but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168006 (owner: 10Dzahn) [08:11:58] (03PS6) 10Giuseppe Lavagetto: sudo: create module, remove old files [puppet] - 10https://gerrit.wikimedia.org/r/167183 [08:18:23] _joe_: you might be able to try that patch on beta cluster since it uses sudo_user to set up rights for mwdeploy [08:18:53] <_joe_> hashar: yeah but I found one error I made... mh [08:25:44] (03PS7) 10Giuseppe Lavagetto: sudo: create module, remove old files [puppet] - 10https://gerrit.wikimedia.org/r/167183 [08:32:26] (03CR) 10Giuseppe Lavagetto: [C: 032] "http://puppet-compiler.wmflabs.org/451/change/167183/html" [puppet] - 10https://gerrit.wikimedia.org/r/167183 (owner: 10Giuseppe Lavagetto) [08:34:18] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:36:28] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: puppet fail [08:36:28] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: puppet fail [08:36:39] PROBLEM - puppet last run on analytics1018 is CRITICAL: CRITICAL: puppet fail [08:36:49] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [08:38:02] <_joe_> mmmh [08:38:50] <_joe_> that's on me I guess [08:40:32] <_joe_> very strange error indeed [08:42:49] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:54:58] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:55:18] RECOVERY - puppet last run on analytics1018 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:55:29] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:56:08] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [09:12:04] (03PS2) 10Giuseppe Lavagetto: role::labs::instance: include sudo::labs_project [puppet] - 10https://gerrit.wikimedia.org/r/168062 [09:17:43] (03PS2) 10Nemo bis: Disable l10nupdate for the duration of CLDR 26 plural migration [puppet] - 10https://gerrit.wikimedia.org/r/168255 (https://bugzilla.wikimedia.org/62861) (owner: 10Nikerabbit) [09:22:43] (03PS1) 10Alexandros Kosiaris: Various ganglia::web fixes [puppet] - 10https://gerrit.wikimedia.org/r/168559 [09:24:49] (03CR) 10Alexandros Kosiaris: [C: 032] Various ganglia::web fixes [puppet] - 10https://gerrit.wikimedia.org/r/168559 (owner: 10Alexandros Kosiaris) [09:27:47] (03PS1) 10Giuseppe Lavagetto: compare-puppet-catalogs: fix hiera copy [software] - 10https://gerrit.wikimedia.org/r/168560 [09:28:20] (03CR) 10Giuseppe Lavagetto: [C: 032] compare-puppet-catalogs: fix hiera copy [software] - 10https://gerrit.wikimedia.org/r/168560 (owner: 10Giuseppe Lavagetto) [10:34:41] What's links.email.donate.wikimedia.org and why does it allegedly not support https? https://github.com/EFForg/https-everywhere/issues/686 [10:35:33] This server could not prove that it is links.email.donate.wikimedia.org; its security certificate is from *.links.mkt41.net. This may be caused by a misconfiguration or an attacker intercepting your connection. [10:37:28] * Reedy replies [10:37:50] Nemo_bis: I guess it's 2 fold. One it's not a WMF site, and 2 it's a multiple subdomain so wouldn't have an SSL cert [10:49:01] Reedy: he probably got the link from the email donation compaigns [10:49:49] In which case it's IMHO a WMF bug that links are sent which don't support HTTPS [10:51:54] indeed [10:52:02] Well, I guess it partially depeends.. [10:52:10] Does the email give a HTTP link? [10:52:22] cause I presume HTTPSE is rewriting it [10:53:31] http link, yes, so I udnerstood [10:55:34] (03PS1) 10Alexandros Kosiaris: Clean up ganglia::web config [puppet] - 10https://gerrit.wikimedia.org/r/168570 [10:58:21] (03CR) 10Alexandros Kosiaris: [C: 032] Clean up ganglia::web config [puppet] - 10https://gerrit.wikimedia.org/r/168570 (owner: 10Alexandros Kosiaris) [11:11:13] (03PS1) 10Giuseppe Lavagetto: compare-puppet-catalogs: specifiy hiera_config [software] - 10https://gerrit.wikimedia.org/r/168571 [11:11:45] (03CR) 10Giuseppe Lavagetto: [C: 032] compare-puppet-catalogs: specifiy hiera_config [software] - 10https://gerrit.wikimedia.org/r/168571 (owner: 10Giuseppe Lavagetto) [11:44:57] wtf job runners [12:01:25] (03PS1) 10Alexandros Kosiaris: ganglia_aggregators for sca, openldap_corp_mirror [puppet] - 10https://gerrit.wikimedia.org/r/168577 [12:10:16] !log restarted gmetad on nickel, it was not responding on port 8654 [12:10:23] Logged the message, Master [12:10:32] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia_aggregators for sca, openldap_corp_mirror [puppet] - 10https://gerrit.wikimedia.org/r/168577 (owner: 10Alexandros Kosiaris) [12:13:07] (03PS2) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [12:15:00] (03CR) 10Hashar: [C: 031] contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [12:37:45] (03PS1) 10Alexandros Kosiaris: Sync up gmetad clusters [puppet] - 10https://gerrit.wikimedia.org/r/168578 [12:41:23] (03CR) 10Alexandros Kosiaris: [C: 032] Sync up gmetad clusters [puppet] - 10https://gerrit.wikimedia.org/r/168578 (owner: 10Alexandros Kosiaris) [13:21:44] (03PS1) 10Alexandros Kosiaris: Set state file path for osm ganglia plugin [puppet] - 10https://gerrit.wikimedia.org/r/168581 [13:23:43] (03PS2) 10Alexandros Kosiaris: make hooft a real 'bastionhost' [puppet] - 10https://gerrit.wikimedia.org/r/168124 (owner: 10Dzahn) [13:23:50] (03CR) 10Alexandros Kosiaris: [C: 032] make hooft a real 'bastionhost' [puppet] - 10https://gerrit.wikimedia.org/r/168124 (owner: 10Dzahn) [13:26:17] (03PS2) 10Alexandros Kosiaris: Set state file path for osm ganglia plugin [puppet] - 10https://gerrit.wikimedia.org/r/168581 [13:29:06] (03CR) 10Alexandros Kosiaris: [C: 032] Set state file path for osm ganglia plugin [puppet] - 10https://gerrit.wikimedia.org/r/168581 (owner: 10Alexandros Kosiaris) [13:30:48] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet last ran 15456 seconds ago, expected 14400 [13:31:47] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:36:07] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.008 second response time [13:38:06] (03PS3) 10Alexandros Kosiaris: Add a ferm service for ssh on all bastionhosts [puppet] - 10https://gerrit.wikimedia.org/r/164542 [13:39:15] !log disabled puppet on uranium. Testing ganglia with SSDs [13:39:23] Logged the message, Master [13:48:54] http://status.wikimedia.org/ is saying DNS is slow, I'm hearing reports that Commons has been slow for about a week [13:49:01] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.025 second response time [13:49:20] (03CR) 10Ottomata: [C: 032 V: 032] Make the metrics polling interval configurable [puppet/kafka] - 10https://gerrit.wikimedia.org/r/168528 (owner: 10Plucas) [13:49:35] Sorry, every site [13:54:39] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:59:59] (03CR) 10Ottomata: Add centralauth to puppet db_config.yaml (031 comment) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [14:01:19] (03CR) 10Alexandros Kosiaris: [C: 032] Add a ferm service for ssh on all bastionhosts [puppet] - 10https://gerrit.wikimedia.org/r/164542 (owner: 10Alexandros Kosiaris) [14:01:54] (03PS1) 10Cmjohnson: Fixing typo on wmnet file [dns] - 10https://gerrit.wikimedia.org/r/168584 [14:02:22] (03CR) 10Cmjohnson: [C: 032] Fixing typo on wmnet file [dns] - 10https://gerrit.wikimedia.org/r/168584 (owner: 10Cmjohnson) [14:04:33] (03PS1) 10Filippo Giunchedi: handle file missing in projectgid.rb [puppet] - 10https://gerrit.wikimedia.org/r/168585 [14:05:09] anyone up for an easy one? ^ [14:05:40] <_joe_> I am [14:06:01] <_joe_> I've bee trying without success come major purge for the last hour or so [14:06:44] that sounds horrific [14:06:47] (03CR) 10Giuseppe Lavagetto: [C: 031] "This is true also for production. Thanks for fixing it!" [puppet] - 10https://gerrit.wikimedia.org/r/168585 (owner: 10Filippo Giunchedi) [14:07:40] <_joe_> godog: I'm trying to get rid of the evil parts of webserver.pp [14:08:18] (03PS2) 10Filippo Giunchedi: handle file missing in projectgid.rb [puppet] - 10https://gerrit.wikimedia.org/r/168585 [14:08:33] (03CR) 10Filippo Giunchedi: [C: 032] handle file missing in projectgid.rb [puppet] - 10https://gerrit.wikimedia.org/r/168585 (owner: 10Filippo Giunchedi) [14:08:42] (03CR) 10Filippo Giunchedi: [V: 032] handle file missing in projectgid.rb [puppet] - 10https://gerrit.wikimedia.org/r/168585 (owner: 10Filippo Giunchedi) [14:08:44] nice [14:16:28] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [14:23:17] (03PS1) 10Alexandros Kosiaris: Minor changes in bastionhost ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/168589 [14:25:08] (03CR) 10Alexandros Kosiaris: [C: 032] Minor changes in bastionhost ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/168589 (owner: 10Alexandros Kosiaris) [14:26:36] _joe_: once https://gerrit.wikimedia.org/r/#/c/168062/2 is merged, that class will have to be removed from the ldap node definition for every instance. [14:26:41] I can merge and do that now if you like. [14:27:23] <_joe_> andrewbogott: I was waiting for you and coren to chime in [14:27:25] <_joe_> :) [14:27:31] <_joe_> so yeah if you feel like it [14:27:57] <_joe_> andrewbogott: https://gerrit.wikimedia.org/r/#/c/168067/ is the companion of that [14:28:06] (03CR) 10Andrew Bogott: [C: 032] role::labs::instance: include sudo::labs_project [puppet] - 10https://gerrit.wikimedia.org/r/168062 (owner: 10Giuseppe Lavagetto) [14:28:11] Is makes sense to me [14:28:33] (03CR) 10Andrew Bogott: [C: 031] wikitech: do not include sudoers::labs_project via ldap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168067 (owner: 10Giuseppe Lavagetto) [14:28:37] It was only separate because it existed before role::labs::instance did. [14:28:42] _joe_: lemme add that to the swap calendar [14:28:49] um… swat [14:29:32] andrewbogott: did you see https://gerrit.wikimedia.org/r/#/c/168269/? [14:29:34] uhoh, is there now swat on Fridays? [14:29:50] um… no swat? [14:30:13] <_joe_> andrewbogott: we usually use sync-file for single-file changes [14:30:22] andrewbogott: there's usually no swat on friday ya [14:30:23] (03CR) 10Andrew Bogott: [C: 032] wikitech: do not include sudoers::labs_project via ldap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168067 (owner: 10Giuseppe Lavagetto) [14:30:29] ok [14:30:34] (03Merged) 10jenkins-bot: wikitech: do not include sudoers::labs_project via ldap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168067 (owner: 10Giuseppe Lavagetto) [14:32:01] _joe_: yeah, but that's supposed to happen during the SWAT window generally. I think since this doesn't touch other wikis it's fine to do outside of a window though [14:33:01] <_joe_> andrewbogott: yeah this is an ops matter I guess [14:33:15] <_joe_> like when sean pushes changes to dbs :) [14:35:00] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 02s) [14:35:11] Logged the message, Master [14:35:53] <_joe_> am I the only one that finds the fact that webserver::static installs lighty and webserver::php5 installs apache mildly disturbing? [14:36:06] um… I've definitely done this before, but right now I'm getting a ton of Permission denied (publickey) [14:36:34] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 03s) [14:36:39] hm, there we go [14:36:41] Logged the message, Master [14:37:12] !log running sync-common on virt1000 [14:37:17] Logged the message, Master [14:38:43] YuviPanda: I've seen a thousand gerrit emails about that patch so I figure you're still working on it frantically. I haven't read it or thought about it much so far. [14:39:03] andrewbogott: it's done for a bit now, just bikeshedding now [14:39:09] ok [14:40:56] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [14:43:16] Reedy: is ^ my fault? I synced the file that I changed, just a minute ago. [14:43:28] :P [14:44:00] it's always Reedy's fault [14:44:10] Oh, great [14:47:53] (03PS3) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [14:55:03] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore .gitreview when building source [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167756 (owner: 10Alexandros Kosiaris) [14:56:20] _joe_: ok, I removed that class from all ldap records. [14:56:40] <_joe_> andrewbogott: let's hope this works :) [14:56:52] seems to. I'll make a new instance to verify [14:57:20] !log francium going offline, ignore any icinga warning [14:57:26] Logged the message, Master [14:57:41] hmmm [14:57:57] robh: schedule maint via icinga ? [14:58:10] and as a side note, it would be cool to do it via that bot :-) [14:59:14] akosiaris: i dont see it in icinga [14:59:21] the log was more of a 'if i missed it somehow' thing [14:59:29] its actually not deployed afaict [15:01:49] (03PS1) 10RobH: reclaiming server francium to spares [puppet] - 10https://gerrit.wikimedia.org/r/168595 [15:02:54] (03PS1) 10RobH: reclaim francium to spares [dns] - 10https://gerrit.wikimedia.org/r/168596 [15:03:08] robh: a ok, sorry then :-/ [15:03:25] (03PS5) 10Alexandros Kosiaris: let bastion hosts have base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [15:04:34] akosiaris: no worries, its a legit request [15:04:43] i should have been more verbose in my admin log ;] [15:05:02] cuz alerts when they dont need to happen are a major issue. [15:05:23] (03PS1) 10Cmjohnson: Adding netboot and dhcpd for elastic1020-1031 [puppet] - 10https://gerrit.wikimedia.org/r/168597 [15:05:53] (03CR) 10RobH: [C: 032] reclaiming server francium to spares [puppet] - 10https://gerrit.wikimedia.org/r/168595 (owner: 10RobH) [15:06:14] (03CR) 10RobH: [C: 032] reclaim francium to spares [dns] - 10https://gerrit.wikimedia.org/r/168596 (owner: 10RobH) [15:06:17] ottomata: ^^ [15:06:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "My minus one is on the basis that this needs some careful coordination as I underlined above to avoid any weird issues and us getting lock" [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [15:07:06] cmjohnson: Do you have any spare EX4500s or EX4550s @ eqiad? [15:07:08] cooooOOl [15:07:31] (03CR) 10Ottomata: [C: 032] Adding netboot and dhcpd for elastic1020-1031 [puppet] - 10https://gerrit.wikimedia.org/r/168597 (owner: 10Cmjohnson) [15:07:39] will let yo merge cmjohnson [15:09:48] _joe_: everything looks good. thanks for the cleanup [15:10:02] <_joe_> andrewbogott: np [15:24:56] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:50] Reedy: I just sent you an email with a plan from Nikerabbit to do some l10n switcheroo (CLDR 26) next week. [15:33:46] (03Draft1) 10Filippo Giunchedi: import debian/ directory [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 [15:36:09] (03PS1) 10ArielGlenn: script to clean up salt keys of deleted labs instances [puppet] - 10https://gerrit.wikimedia.org/r/168601 [15:36:54] (03CR) 10jenkins-bot: [V: 04-1] script to clean up salt keys of deleted labs instances [puppet] - 10https://gerrit.wikimedia.org/r/168601 (owner: 10ArielGlenn) [15:38:55] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:44:26] (03PS2) 10ArielGlenn: script to clean up salt keys of deleted labs instances [puppet] - 10https://gerrit.wikimedia.org/r/168601 [15:47:06] gwicke: i forget, can you show me your xmldump -> cassandra parser thing again? [15:47:08] link please? [15:54:13] (03PS1) 10Giuseppe Lavagetto: webserver: move to a module, fix and remove a few things [puppet] - 10https://gerrit.wikimedia.org/r/168604 [15:56:45] (03CR) 10Andrew Bogott: [C: 031] "Oh, I was about to say that this is too harsh because it might purge instances that are known to nova, but now I see that you're double-ch" [puppet] - 10https://gerrit.wikimedia.org/r/168601 (owner: 10ArielGlenn) [15:58:31] ottomata: https://github.com/gwicke/restbase-cassandra/tree/master/test/dump [15:58:50] danke [15:58:55] js, ah right [15:58:55] k [15:59:05] cool [16:02:38] (03CR) 10Nuria: Add centralauth to puppet db_config.yaml (032 comments) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [16:30:27] PROBLEM - Disk space on virt1006 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [16:34:31] (03PS1) 10Chad: Phabricator: repository.default-local-path to proper location [puppet] - 10https://gerrit.wikimedia.org/r/168611 [16:37:10] (03CR) 10Rush: [C: 031] "should be good" [puppet] - 10https://gerrit.wikimedia.org/r/168611 (owner: 10Chad) [16:44:23] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [16:44:44] (03PS2) 10Chad: Phabricator: repository.default-local-path to proper location [puppet] - 10https://gerrit.wikimedia.org/r/168611 [16:45:12] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [16:46:03] RECOVERY - Disk space on virt1006 is OK: DISK OK [16:50:12] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 137 seconds ago with 0 failures [16:53:30] _joe_: hey [16:58:43] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:02:13] ori: Hey, can you refresh my mind regarding https://gerrit.wikimedia.org/r/#/c/145997/3 ? [17:02:19] I'm concerned [17:02:38] It was a fix for something that caused a fair amount of users to get http bad gateway errors [17:02:45] but we never merged it? [17:03:12] it wasn't needed [17:05:40] <_joe_> ori: hi [17:07:57] !log getting ready to replace a failed disk on ganglia (server:nickel)...it will be offline for a few minutes [17:08:05] Logged the message, Master [17:10:22] <_joe_> ori: I guess most hhvm issues we got reported came from memory exhaustion; I rolling restarted all hhvm appservers on wednesday, we may need to do that again during this weekend or on monday [17:10:32] RECOVERY - RAID on nickel is OK: OK: Active: 1, Working: 1, Failed: 0, Spare: 0 [17:12:23] PROBLEM - Host nickel is DOWN: PING CRITICAL - Packet loss = 100% [17:17:33] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [17:19:54] PROBLEM - SSH on nickel is CRITICAL: Connection refused [17:19:54] PROBLEM - Disk space on nickel is CRITICAL: Connection refused by host [17:19:54] PROBLEM - check if dhclient is running on nickel is CRITICAL: Connection refused by host [17:19:54] PROBLEM - puppet last run on nickel is CRITICAL: Connection refused by host [17:20:05] PROBLEM - check if salt-minion is running on nickel is CRITICAL: Connection refused by host [17:20:05] PROBLEM - check configured eth on nickel is CRITICAL: Connection refused by host [17:20:22] PROBLEM - HTTP on nickel is CRITICAL: Connection timed out [17:20:42] PROBLEM - RAID on nickel is CRITICAL: Timeout while attempting connection [17:20:47] Anyone working on poor nickel? [17:20:48] PROBLEM - DPKG on nickel is CRITICAL: Timeout while attempting connection [17:23:31] Coren: in the backscroll, cmjohnson says he's replacing a disk on nickel. [17:24:32] PROBLEM - Host nickel is DOWN: CRITICAL - Plugin timed out after 15 seconds [17:29:07] (03CR) 10Ricordisamoa: [C: 04-1] "Many minor changes are already in review as I0fad583a66e71e02e9a38b359e60d238167825ef." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [17:32:06] (03CR) 10Glaisher: "Is that a reason to -1 this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [17:35:25] (03PS1) 1001tonythomas: Make BounceHandler extension work on en-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 [17:37:14] RECOVERY - SSH on nickel is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7.1 (protocol 2.0) [17:37:23] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 1.65 ms [17:37:49] (03CR) 10Ricordisamoa: "It is generally not wise to introduce minor changes that duplicate an existing changeset." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [17:39:23] RECOVERY - check configured eth on nickel is OK: NRPE: Unable to read output [17:39:42] RECOVERY - DPKG on nickel is OK: All packages OK [17:39:43] RECOVERY - RAID on nickel is OK: OK: Active: 1, Working: 1, Failed: 0, Spare: 0 [17:44:21] (03CR) 10Glaisher: minor changes to InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 (owner: 10Ricordisamoa) [17:44:41] (03PS22) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [17:46:10] cmjohnson: are you done? [17:47:06] [ 184.175555] EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 676513 [17:47:09] [ 184.184438] Aborting journal on device dm-0-8. [17:47:12] [ 184.189128] EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal [17:47:15] [ 184.200031] EXT4-fs (dm-0): Remounting filesystem read-only [17:47:17] fun [17:47:22] why did we do this on friday again? :) [17:48:17] paravoid: after the new disk..it didn't boot and to avoid longer delays i put the old disk back in and now we're getting these ext4-fs errors [17:48:21] (03CR) 10Ori.livneh: [C: 032] contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [17:49:06] i assume the original issue is with grub...although I verified that grub was on /dev/sdb before doing anything [17:49:10] rebooting it [17:50:23] I wonder if Alex's rsync finished [17:50:33] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:54] Hm.. how should I puppetise when there can be multiple instances of something but they depend on something common, where do I define that common resource? E.g. for the testing "localhost" apache we use contint::localvhost resources, and they get specified docroot like /srv/localhost/, we currently repeat the resource for /srv/localhost in three places. [17:51:00] That conflicts when there is more than one on one node [17:51:02] PROBLEM - Host nickel is DOWN: PING CRITICAL - Packet loss = 100% [17:51:33] Where can I define a resource (e.g. a File) that but only once regardless of how many times a resource of that type is declared? [17:52:22] Krinkle: if there is a unifying commonality, it's better to abstract it out and have each place include it [17:52:32] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [17:53:03] Krinkle: in a pinch you can do ensure_resources() [17:53:22] RECOVERY - puppet last run on nickel is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:53:24] RECOVERY - check if salt-minion is running on nickel is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:53:36] second disk broken too [17:53:41] [ 108.072953] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [17:53:44] [ 108.079384] ata1.00: BMDMA stat 0x24 [17:53:46] ori: So in this case we have 1) https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/ci.pp#L384-L396 and 2) https://github.com/wikimedia/operations-puppet/blob/production/modules/contint/manifests/qunit_localhost.pp [17:53:47] [ 108.082949] ata1.00: failed command: READ DMA [17:53:50] both using https://github.com/wikimedia/operations-puppet/blob/production/modules/contint/manifests/localvhost.pp [17:53:57] but mdstat is very weird [17:54:04] I need both on the same node soon, so I need a place to put the /srv/localhost [17:54:07] are we sure we didn't boot from the wrong disk? [17:54:10] Krinkle: "/srv/localhost" is pretty weird [17:54:32] i switched bios back to boot from port A [17:54:35] ori: it's like /srv/org/wikimedia/foo we have /srv/localhost/{qunit,mediawiki} [17:54:43] but for ports instead of subdomains [17:54:51] cmjohnson: yeah, you've booted from the broken disk [17:55:10] ori: /var/www seemed reserved and too generic to call dibs on. Happy to put it elsewhere. [17:55:17] cmjohnson: poweroff and remove that disk [17:55:33] I put the broken disk back in because I couldn't get nickel to boot with the new disk [17:55:36] Krinkle: taking a look, sec [17:55:41] (03CR) 10Dzahn: "oops, broke puppet run on bast1001:" [puppet] - 10https://gerrit.wikimedia.org/r/167885 (owner: 10Dzahn) [17:56:07] cmjohnson: the RAID between the old disks is broken so they act as two independent disks [17:56:26] cmjohnson: and the broken one has stale data, back from September 2nd [17:56:32] cmjohnson: so now we've booted with that [17:56:40] cmjohnson: unplug that disk and reboot [17:57:05] Krinkle: in this case it seems like hashar was lazy about fixing a group issue [17:57:22] andrewbogott: would it be possible to add a 'jenkins-deploy' group in labs? [17:57:49] cmjohnson: hm wait a sec [17:58:12] ori: you mean in ldap? It's easy enough, let me make sure there isn't one already... [17:58:16] ok [17:58:22] andrewbogott: thanks [17:58:23] !log stat1001 - Duplicate declaration: Package[nodejs] [17:58:29] Logged the message, Master [17:58:48] paravoid: i just need to be able to boot from /dev/sdb which I couldn't get to earlier [17:59:03] ottomata: there's a conflict between statistics.pp and limn module on stat1001 [17:59:12] cmjohnson: yeah, fixed [17:59:14] mutante: let me take a look for a sec [17:59:22] cmjohnson: shutdown and unplug that broken disk [17:59:23] ori: cool,thx [18:00:17] paravoid: details? what did i miss? [18:00:28] (03PS1) 10Ori.livneh: misc::statistics: use require_package('nodejs'), as we do elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/168625 [18:00:33] mutante: ^ [18:00:41] cmjohnson: I did a "grub-install /dev/sdb" [18:01:30] PROBLEM - Host nickel is DOWN: PING CRITICAL - Packet loss = 100% [18:01:38] ori: Hm.. what do you mean? [18:02:13] (03CR) 10Dzahn: [C: 031] "yes, thank you. that should fix the duplicate declaration on stat1001" [puppet] - 10https://gerrit.wikimedia.org/r/168625 (owner: 10Ori.livneh) [18:02:16] ori: Should the user be in puppet? [18:02:18] ori: there's already a mwdeploy group; this would be the same as that? [18:02:22] Can you explain what it's for? [18:02:22] ori: want me to merge ? [18:02:26] and it depend on that (via a parameter I guess) [18:02:27] sure [18:02:45] (that was @mutante) [18:03:01] (03CR) 10Dzahn: [C: 032] "Duplicate declaration: Package[nodejs]" [puppet] - 10https://gerrit.wikimedia.org/r/168625 (owner: 10Ori.livneh) [18:03:27] andrewbogott: it's so we can avoid having ugly workarounds like /srv/localhost declared separately in role/ci.pp with comment " group => 'root', # no jenkins-deploy group in labs " [18:05:12] ori: Hm.. but it seems they also differ as jenkins-deploy and jenkins-slave between the two uses [18:05:21] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:05:30] interesting that the owner is fine but not the group? [18:05:37] (03CR) 10Dzahn: "yep. fixed. RECOVERY - puppet last run on stat1001 is OK" [puppet] - 10https://gerrit.wikimedia.org/r/168625 (owner: 10Ori.livneh) [18:05:47] ugh [18:05:57] Krinkle: might want to take it up with hashar [18:06:29] (03PS4) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [18:06:33] ori: let's say for now the same users exist and that's not an issue. What about the resource in general [18:06:47] Krinkle: the resource would stay in the module and move out of the role [18:07:17] I don't think the solution should have the user/group hardcoded in the definition, it should be fine to pass different values in prod and labs. [18:07:44] as long as I can use it in multiple labs roles that are applies to the same role/node [18:07:52] so move it out of the module and have a separate ::foo::production / ::foo::labs roles [18:07:55] or use hiera [18:07:59] but i gotta run, sorry [18:08:02] noam's up [18:08:33] ori, Krinkle, I'm happy to create groups -- make me a bug once you figure out what you need. [18:08:53] I don't know about the groups, it's a strange setup hashar made. Don't create anything yet. [18:09:08] I can work around that for now, my problem is with something else. [18:09:13] I'll get back if I need anything, thanks. [18:09:24] (03PS1) 10BBlack: allow zero-length const_string_add() [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/168626 [18:09:48] (03CR) 10BBlack: [C: 032 V: 032] allow zero-length const_string_add() [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/168626 (owner: 10BBlack) [18:10:00] paravoid: i get a grub rescue prompt [18:10:12] grmbl [18:10:29] (03CR) 10Dzahn: move 'noc' from misc to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168006 (owner: 10Dzahn) [18:15:20] cmjohnson: can I grab the console and poke around a bit? [18:15:26] sure [18:15:33] all yours [18:16:00] bblack ..i couldn't find a /boot [18:16:16] ok [18:16:37] I just want to catch up to wherever paravoid was at and confirm. then we'll probably end up booting the bad disk again and try to re-fix the good one. [18:16:47] (03PS1) 10Kaldari: Adding WikiGrok to extensions list for testing on Beta Labs, etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168628 (https://bugzilla.wikimedia.org/72465) [18:17:33] (03CR) 10Kaldari: [C: 032] Adding WikiGrok to extensions list for testing on Beta Labs, etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168628 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [18:17:40] (03Merged) 10jenkins-bot: Adding WikiGrok to extensions list for testing on Beta Labs, etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168628 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [18:19:16] (03PS1) 10Krinkle: contint: Minor clean up [puppet] - 10https://gerrit.wikimedia.org/r/168629 [18:19:18] (03PS1) 10Krinkle: contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 [18:19:20] (03PS1) 10Krinkle: [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 [18:20:00] (03CR) 10jenkins-bot: [V: 04-1] contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [18:20:22] (03CR) 10jenkins-bot: [V: 04-1] [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 (owner: 10Krinkle) [18:20:27] cmjohnson: is the bad disk still out, or back in? [18:20:38] it's still out [18:20:59] (03PS2) 10Krinkle: contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 [18:21:05] (03PS2) 10Krinkle: [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 [18:21:24] cmjohnson: bios shows 2x 500GB disks on sata ports A + B [18:21:45] correct...there is a new disk in port A [18:21:50] ah! [18:21:53] (03PS1) 10MaxSem: Revert "Adding WikiGrok to extensions list for testing on Beta Labs, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168633 [18:22:02] (03CR) 10MaxSem: [C: 032] Revert "Adding WikiGrok to extensions list for testing on Beta Labs, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168633 (owner: 10MaxSem) [18:22:12] (03Merged) 10jenkins-bot: Revert "Adding WikiGrok to extensions list for testing on Beta Labs, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168633 (owner: 10MaxSem) [18:22:46] can we try booting with just the old unfailed disk in port B, and leaving A unplugged for now? [18:23:02] sure [18:23:07] ...give me a sec [18:23:09] (is this stuff hotpluggable at the hw level btw?) [18:23:20] nope...internal [18:23:43] can you shutdown [18:23:48] oh yeah [18:24:06] (03CR) 10Nuria: [C: 031] "Tested on vagrant. Please Andrew check that these changes are likely to work in staging/prod." [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [18:24:09] well I disconnected anyways, it's back on a grub rescue prompt [18:25:53] bblack...booting [18:28:00] (03PS4) 10Dzahn: move 'noc' from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/168006 [18:29:22] (03PS2) 1001tonythomas: Make BounceHandler extension work on en-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 [18:30:03] (03CR) 10jenkins-bot: [V: 04-1] Make BounceHandler extension work on en-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:30:09] cmjohnson: I'm getting nothing on serial cons, is anything happening there? [18:30:39] oh there we go, grubrescue again [18:30:43] had to hit f1 [18:31:05] hmmmm [18:32:40] ok well let's go back to the original disk setup then? at least that boots far enough to investigate grub [18:32:47] (bad disk in A, old good disk in b) [18:32:48] cmjohnson: ^ [18:33:10] so back to original setup [18:33:31] (03PS5) 10Dzahn: move 'noc' from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/168006 [18:33:43] yeah [18:33:47] btw I love this: [18:33:49] powerdown - power server off [18:33:49] powerup - power server onpowerdown - power server off [18:33:50] (03CR) 10Jgreen: [C: 04-1] Make BounceHandler extension work on en-wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:34:15] why not call them poweroff and poweron, if that's how they're described anyways? :p [18:34:22] lots of options for you [18:34:45] I swear every time I first try "racadm serveraction poweroff", then get an error, then check help, then I find out it's "powerdown means power off" [18:34:55] maybe this time I'll remember since I commented about it [18:35:04] (03CR) 10Hoo man: [C: 04-1] Make BounceHandler extension work on en-wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:36:26] (03CR) 10Hoo man: Make BounceHandler extension work on en-wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:38:23] bblack booting [18:38:50] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [18:39:32] ok I'm gonna go futz with grub some [18:41:40] cmjohnson: nickel's sda and sdb seem radically different in usage and parition layout, was this really a simple mirror situation? [18:42:04] bblack..it was but I failed the disk and then removed it before powering off to replace [18:42:19] I mean it doesn't seem like it was [18:42:41]