[00:12:32] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [00:52:22] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 348 seconds [00:52:58] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 383 seconds [00:53:48] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:09] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [01:08:41] (03PS1) 10Plucas: Make the metrics polling interval configurable [puppet/kafka] - 10https://gerrit.wikimedia.org/r/168528 [01:53:01] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [02:05:09] RECOVERY - Disk space on ocg1001 is OK: DISK OK [02:07:55] (03PS1) 10Springle: ocg log fills up faster than daily cycle [puppet] - 10https://gerrit.wikimedia.org/r/168536 [02:20:02] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 1 failures [02:23:49] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures [02:24:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [02:36:41] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [02:41:30] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:46:15] !log LocalisationUpdate completed (1.25wmf4) at 2014-10-24 02:46:15+00:00 [02:46:26] Logged the message, Master [03:01:20] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures [03:04:10] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [03:19:49] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:22:36] !log LocalisationUpdate completed (1.25wmf5) at 2014-10-24 03:22:36+00:00 [03:22:40] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:22:43] Logged the message, Master [04:20:40] (03PS1) 10Ori.livneh: HHVM: report memory stats to Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/168538 [04:24:08] (03CR) 10Ori.livneh: [C: 032] HHVM: report memory stats to Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/168538 (owner: 10Ori.livneh) [05:04:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 24 05:04:33 UTC 2014 (duration 4m 32s) [05:04:42] Logged the message, Master [06:29:19] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:38] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:09] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:39] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:51:01] (03PS1) 10Gage: logstash filter: gelf: hadoop: update for logstash 1.4.2 [puppet] - 10https://gerrit.wikimedia.org/r/168548 [06:52:25] (03CR) 10Gage: [C: 032] logstash filter: gelf: hadoop: update for logstash 1.4.2 [puppet] - 10https://gerrit.wikimedia.org/r/168548 (owner: 10Gage) [07:26:25] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [07:36:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [07:42:35] (03PS1) 10Alexandros Kosiaris: osm.py: Remove some debug statements [puppet] - 10https://gerrit.wikimedia.org/r/168555 [07:44:46] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:46:00] (03PS2) 10Alexandros Kosiaris: osm.py: Remove some debug statements [puppet] - 10https://gerrit.wikimedia.org/r/168555 [07:50:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:56:24] (03CR) 10Alexandros Kosiaris: [C: 032] osm.py: Remove some debug statements [puppet] - 10https://gerrit.wikimedia.org/r/168555 (owner: 10Alexandros Kosiaris) [08:04:59] (03CR) 10Giuseppe Lavagetto: [C: 031] "one nitpick but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168006 (owner: 10Dzahn) [08:11:58] (03PS6) 10Giuseppe Lavagetto: sudo: create module, remove old files [puppet] - 10https://gerrit.wikimedia.org/r/167183 [08:18:23] _joe_: you might be able to try that patch on beta cluster since it uses sudo_user to set up rights for mwdeploy [08:18:53] <_joe_> hashar: yeah but I found one error I made... mh [08:25:44] (03PS7) 10Giuseppe Lavagetto: sudo: create module, remove old files [puppet] - 10https://gerrit.wikimedia.org/r/167183 [08:32:26] (03CR) 10Giuseppe Lavagetto: [C: 032] "http://puppet-compiler.wmflabs.org/451/change/167183/html" [puppet] - 10https://gerrit.wikimedia.org/r/167183 (owner: 10Giuseppe Lavagetto) [08:34:18] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:36:28] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: puppet fail [08:36:28] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: puppet fail [08:36:39] PROBLEM - puppet last run on analytics1018 is CRITICAL: CRITICAL: puppet fail [08:36:49] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [08:38:02] <_joe_> mmmh [08:38:50] <_joe_> that's on me I guess [08:40:32] <_joe_> very strange error indeed [08:42:49] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:54:58] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:55:18] RECOVERY - puppet last run on analytics1018 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:55:29] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:56:08] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [09:12:04] (03PS2) 10Giuseppe Lavagetto: role::labs::instance: include sudo::labs_project [puppet] - 10https://gerrit.wikimedia.org/r/168062 [09:17:43] (03PS2) 10Nemo bis: Disable l10nupdate for the duration of CLDR 26 plural migration [puppet] - 10https://gerrit.wikimedia.org/r/168255 (https://bugzilla.wikimedia.org/62861) (owner: 10Nikerabbit) [09:22:43] (03PS1) 10Alexandros Kosiaris: Various ganglia::web fixes [puppet] - 10https://gerrit.wikimedia.org/r/168559 [09:24:49] (03CR) 10Alexandros Kosiaris: [C: 032] Various ganglia::web fixes [puppet] - 10https://gerrit.wikimedia.org/r/168559 (owner: 10Alexandros Kosiaris) [09:27:47] (03PS1) 10Giuseppe Lavagetto: compare-puppet-catalogs: fix hiera copy [software] - 10https://gerrit.wikimedia.org/r/168560 [09:28:20] (03CR) 10Giuseppe Lavagetto: [C: 032] compare-puppet-catalogs: fix hiera copy [software] - 10https://gerrit.wikimedia.org/r/168560 (owner: 10Giuseppe Lavagetto) [10:34:41] What's links.email.donate.wikimedia.org and why does it allegedly not support https? https://github.com/EFForg/https-everywhere/issues/686 [10:35:33] This server could not prove that it is links.email.donate.wikimedia.org; its security certificate is from *.links.mkt41.net. This may be caused by a misconfiguration or an attacker intercepting your connection. [10:37:28] * Reedy replies [10:37:50] Nemo_bis: I guess it's 2 fold. One it's not a WMF site, and 2 it's a multiple subdomain so wouldn't have an SSL cert [10:49:01] Reedy: he probably got the link from the email donation compaigns [10:49:49] In which case it's IMHO a WMF bug that links are sent which don't support HTTPS [10:51:54] indeed [10:52:02] Well, I guess it partially depeends.. [10:52:10] Does the email give a HTTP link? [10:52:22] cause I presume HTTPSE is rewriting it [10:53:31] http link, yes, so I udnerstood [10:55:34] (03PS1) 10Alexandros Kosiaris: Clean up ganglia::web config [puppet] - 10https://gerrit.wikimedia.org/r/168570 [10:58:21] (03CR) 10Alexandros Kosiaris: [C: 032] Clean up ganglia::web config [puppet] - 10https://gerrit.wikimedia.org/r/168570 (owner: 10Alexandros Kosiaris) [11:11:13] (03PS1) 10Giuseppe Lavagetto: compare-puppet-catalogs: specifiy hiera_config [software] - 10https://gerrit.wikimedia.org/r/168571 [11:11:45] (03CR) 10Giuseppe Lavagetto: [C: 032] compare-puppet-catalogs: specifiy hiera_config [software] - 10https://gerrit.wikimedia.org/r/168571 (owner: 10Giuseppe Lavagetto) [11:44:57] wtf job runners [12:01:25] (03PS1) 10Alexandros Kosiaris: ganglia_aggregators for sca, openldap_corp_mirror [puppet] - 10https://gerrit.wikimedia.org/r/168577 [12:10:16] !log restarted gmetad on nickel, it was not responding on port 8654 [12:10:23] Logged the message, Master [12:10:32] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia_aggregators for sca, openldap_corp_mirror [puppet] - 10https://gerrit.wikimedia.org/r/168577 (owner: 10Alexandros Kosiaris) [12:13:07] (03PS2) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [12:15:00] (03CR) 10Hashar: [C: 031] contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [12:37:45] (03PS1) 10Alexandros Kosiaris: Sync up gmetad clusters [puppet] - 10https://gerrit.wikimedia.org/r/168578 [12:41:23] (03CR) 10Alexandros Kosiaris: [C: 032] Sync up gmetad clusters [puppet] - 10https://gerrit.wikimedia.org/r/168578 (owner: 10Alexandros Kosiaris) [13:21:44] (03PS1) 10Alexandros Kosiaris: Set state file path for osm ganglia plugin [puppet] - 10https://gerrit.wikimedia.org/r/168581 [13:23:43] (03PS2) 10Alexandros Kosiaris: make hooft a real 'bastionhost' [puppet] - 10https://gerrit.wikimedia.org/r/168124 (owner: 10Dzahn) [13:23:50] (03CR) 10Alexandros Kosiaris: [C: 032] make hooft a real 'bastionhost' [puppet] - 10https://gerrit.wikimedia.org/r/168124 (owner: 10Dzahn) [13:26:17] (03PS2) 10Alexandros Kosiaris: Set state file path for osm ganglia plugin [puppet] - 10https://gerrit.wikimedia.org/r/168581 [13:29:06] (03CR) 10Alexandros Kosiaris: [C: 032] Set state file path for osm ganglia plugin [puppet] - 10https://gerrit.wikimedia.org/r/168581 (owner: 10Alexandros Kosiaris) [13:30:48] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet last ran 15456 seconds ago, expected 14400 [13:31:47] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:36:07] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.008 second response time [13:38:06] (03PS3) 10Alexandros Kosiaris: Add a ferm service for ssh on all bastionhosts [puppet] - 10https://gerrit.wikimedia.org/r/164542 [13:39:15] !log disabled puppet on uranium. Testing ganglia with SSDs [13:39:23] Logged the message, Master [13:48:54] http://status.wikimedia.org/ is saying DNS is slow, I'm hearing reports that Commons has been slow for about a week [13:49:01] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.025 second response time [13:49:20] (03CR) 10Ottomata: [C: 032 V: 032] Make the metrics polling interval configurable [puppet/kafka] - 10https://gerrit.wikimedia.org/r/168528 (owner: 10Plucas) [13:49:35] Sorry, every site [13:54:39] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:59:59] (03CR) 10Ottomata: Add centralauth to puppet db_config.yaml (031 comment) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [14:01:19] (03CR) 10Alexandros Kosiaris: [C: 032] Add a ferm service for ssh on all bastionhosts [puppet] - 10https://gerrit.wikimedia.org/r/164542 (owner: 10Alexandros Kosiaris) [14:01:54] (03PS1) 10Cmjohnson: Fixing typo on wmnet file [dns] - 10https://gerrit.wikimedia.org/r/168584 [14:02:22] (03CR) 10Cmjohnson: [C: 032] Fixing typo on wmnet file [dns] - 10https://gerrit.wikimedia.org/r/168584 (owner: 10Cmjohnson) [14:04:33] (03PS1) 10Filippo Giunchedi: handle file missing in projectgid.rb [puppet] - 10https://gerrit.wikimedia.org/r/168585 [14:05:09] anyone up for an easy one? ^ [14:05:40] <_joe_> I am [14:06:01] <_joe_> I've bee trying without success come major purge for the last hour or so [14:06:44] that sounds horrific [14:06:47] (03CR) 10Giuseppe Lavagetto: [C: 031] "This is true also for production. Thanks for fixing it!" [puppet] - 10https://gerrit.wikimedia.org/r/168585 (owner: 10Filippo Giunchedi) [14:07:40] <_joe_> godog: I'm trying to get rid of the evil parts of webserver.pp [14:08:18] (03PS2) 10Filippo Giunchedi: handle file missing in projectgid.rb [puppet] - 10https://gerrit.wikimedia.org/r/168585 [14:08:33] (03CR) 10Filippo Giunchedi: [C: 032] handle file missing in projectgid.rb [puppet] - 10https://gerrit.wikimedia.org/r/168585 (owner: 10Filippo Giunchedi) [14:08:42] (03CR) 10Filippo Giunchedi: [V: 032] handle file missing in projectgid.rb [puppet] - 10https://gerrit.wikimedia.org/r/168585 (owner: 10Filippo Giunchedi) [14:08:44] nice [14:16:28] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [14:23:17] (03PS1) 10Alexandros Kosiaris: Minor changes in bastionhost ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/168589 [14:25:08] (03CR) 10Alexandros Kosiaris: [C: 032] Minor changes in bastionhost ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/168589 (owner: 10Alexandros Kosiaris) [14:26:36] _joe_: once https://gerrit.wikimedia.org/r/#/c/168062/2 is merged, that class will have to be removed from the ldap node definition for every instance. [14:26:41] I can merge and do that now if you like. [14:27:23] <_joe_> andrewbogott: I was waiting for you and coren to chime in [14:27:25] <_joe_> :) [14:27:31] <_joe_> so yeah if you feel like it [14:27:57] <_joe_> andrewbogott: https://gerrit.wikimedia.org/r/#/c/168067/ is the companion of that [14:28:06] (03CR) 10Andrew Bogott: [C: 032] role::labs::instance: include sudo::labs_project [puppet] - 10https://gerrit.wikimedia.org/r/168062 (owner: 10Giuseppe Lavagetto) [14:28:11] Is makes sense to me [14:28:33] (03CR) 10Andrew Bogott: [C: 031] wikitech: do not include sudoers::labs_project via ldap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168067 (owner: 10Giuseppe Lavagetto) [14:28:37] It was only separate because it existed before role::labs::instance did. [14:28:42] _joe_: lemme add that to the swap calendar [14:28:49] um… swat [14:29:32] andrewbogott: did you see https://gerrit.wikimedia.org/r/#/c/168269/? [14:29:34] uhoh, is there now swat on Fridays? [14:29:50] um… no swat? [14:30:13] <_joe_> andrewbogott: we usually use sync-file for single-file changes [14:30:22] andrewbogott: there's usually no swat on friday ya [14:30:23] (03CR) 10Andrew Bogott: [C: 032] wikitech: do not include sudoers::labs_project via ldap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168067 (owner: 10Giuseppe Lavagetto) [14:30:29] ok [14:30:34] (03Merged) 10jenkins-bot: wikitech: do not include sudoers::labs_project via ldap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168067 (owner: 10Giuseppe Lavagetto) [14:32:01] _joe_: yeah, but that's supposed to happen during the SWAT window generally. I think since this doesn't touch other wikis it's fine to do outside of a window though [14:33:01] <_joe_> andrewbogott: yeah this is an ops matter I guess [14:33:15] <_joe_> like when sean pushes changes to dbs :) [14:35:00] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 02s) [14:35:11] Logged the message, Master [14:35:53] <_joe_> am I the only one that finds the fact that webserver::static installs lighty and webserver::php5 installs apache mildly disturbing? [14:36:06] um… I've definitely done this before, but right now I'm getting a ton of Permission denied (publickey) [14:36:34] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 03s) [14:36:39] hm, there we go [14:36:41] Logged the message, Master [14:37:12] !log running sync-common on virt1000 [14:37:17] Logged the message, Master [14:38:43] YuviPanda: I've seen a thousand gerrit emails about that patch so I figure you're still working on it frantically. I haven't read it or thought about it much so far. [14:39:03] andrewbogott: it's done for a bit now, just bikeshedding now [14:39:09] ok [14:40:56] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [14:43:16] Reedy: is ^ my fault? I synced the file that I changed, just a minute ago. [14:43:28] :P [14:44:00] it's always Reedy's fault [14:44:10] Oh, great [14:47:53] (03PS3) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [14:55:03] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore .gitreview when building source [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167756 (owner: 10Alexandros Kosiaris) [14:56:20] _joe_: ok, I removed that class from all ldap records. [14:56:40] <_joe_> andrewbogott: let's hope this works :) [14:56:52] seems to. I'll make a new instance to verify [14:57:20] !log francium going offline, ignore any icinga warning [14:57:26] Logged the message, Master [14:57:41] hmmm [14:57:57] robh: schedule maint via icinga ? [14:58:10] and as a side note, it would be cool to do it via that bot :-) [14:59:14] akosiaris: i dont see it in icinga [14:59:21] the log was more of a 'if i missed it somehow' thing [14:59:29] its actually not deployed afaict [15:01:49] (03PS1) 10RobH: reclaiming server francium to spares [puppet] - 10https://gerrit.wikimedia.org/r/168595 [15:02:54] (03PS1) 10RobH: reclaim francium to spares [dns] - 10https://gerrit.wikimedia.org/r/168596 [15:03:08] robh: a ok, sorry then :-/ [15:03:25] (03PS5) 10Alexandros Kosiaris: let bastion hosts have base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [15:04:34] akosiaris: no worries, its a legit request [15:04:43] i should have been more verbose in my admin log ;] [15:05:02] cuz alerts when they dont need to happen are a major issue. [15:05:23] (03PS1) 10Cmjohnson: Adding netboot and dhcpd for elastic1020-1031 [puppet] - 10https://gerrit.wikimedia.org/r/168597 [15:05:53] (03CR) 10RobH: [C: 032] reclaiming server francium to spares [puppet] - 10https://gerrit.wikimedia.org/r/168595 (owner: 10RobH) [15:06:14] (03CR) 10RobH: [C: 032] reclaim francium to spares [dns] - 10https://gerrit.wikimedia.org/r/168596 (owner: 10RobH) [15:06:17] ottomata: ^^ [15:06:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "My minus one is on the basis that this needs some careful coordination as I underlined above to avoid any weird issues and us getting lock" [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [15:07:06] cmjohnson: Do you have any spare EX4500s or EX4550s @ eqiad? [15:07:08] cooooOOl [15:07:31] (03CR) 10Ottomata: [C: 032] Adding netboot and dhcpd for elastic1020-1031 [puppet] - 10https://gerrit.wikimedia.org/r/168597 (owner: 10Cmjohnson) [15:07:39] will let yo merge cmjohnson [15:09:48] _joe_: everything looks good. thanks for the cleanup [15:10:02] <_joe_> andrewbogott: np [15:24:56] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:50] Reedy: I just sent you an email with a plan from Nikerabbit to do some l10n switcheroo (CLDR 26) next week. [15:33:46] (03Draft1) 10Filippo Giunchedi: import debian/ directory [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 [15:36:09] (03PS1) 10ArielGlenn: script to clean up salt keys of deleted labs instances [puppet] - 10https://gerrit.wikimedia.org/r/168601 [15:36:54] (03CR) 10jenkins-bot: [V: 04-1] script to clean up salt keys of deleted labs instances [puppet] - 10https://gerrit.wikimedia.org/r/168601 (owner: 10ArielGlenn) [15:38:55] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:44:26] (03PS2) 10ArielGlenn: script to clean up salt keys of deleted labs instances [puppet] - 10https://gerrit.wikimedia.org/r/168601 [15:47:06] gwicke: i forget, can you show me your xmldump -> cassandra parser thing again? [15:47:08] link please? [15:54:13] (03PS1) 10Giuseppe Lavagetto: webserver: move to a module, fix and remove a few things [puppet] - 10https://gerrit.wikimedia.org/r/168604 [15:56:45] (03CR) 10Andrew Bogott: [C: 031] "Oh, I was about to say that this is too harsh because it might purge instances that are known to nova, but now I see that you're double-ch" [puppet] - 10https://gerrit.wikimedia.org/r/168601 (owner: 10ArielGlenn) [15:58:31] ottomata: https://github.com/gwicke/restbase-cassandra/tree/master/test/dump [15:58:50] danke [15:58:55] js, ah right [15:58:55] k [15:59:05] cool [16:02:38] (03CR) 10Nuria: Add centralauth to puppet db_config.yaml (032 comments) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [16:30:27] PROBLEM - Disk space on virt1006 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [16:34:31] (03PS1) 10Chad: Phabricator: repository.default-local-path to proper location [puppet] - 10https://gerrit.wikimedia.org/r/168611 [16:37:10] (03CR) 10Rush: [C: 031] "should be good" [puppet] - 10https://gerrit.wikimedia.org/r/168611 (owner: 10Chad) [16:44:23] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [16:44:44] (03PS2) 10Chad: Phabricator: repository.default-local-path to proper location [puppet] - 10https://gerrit.wikimedia.org/r/168611 [16:45:12] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [16:46:03] RECOVERY - Disk space on virt1006 is OK: DISK OK [16:50:12] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 137 seconds ago with 0 failures [16:53:30] _joe_: hey [16:58:43] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:02:13] ori: Hey, can you refresh my mind regarding https://gerrit.wikimedia.org/r/#/c/145997/3 ? [17:02:19] I'm concerned [17:02:38] It was a fix for something that caused a fair amount of users to get http bad gateway errors [17:02:45] but we never merged it? [17:03:12] it wasn't needed [17:05:40] <_joe_> ori: hi [17:07:57] !log getting ready to replace a failed disk on ganglia (server:nickel)...it will be offline for a few minutes [17:08:05] Logged the message, Master [17:10:22] <_joe_> ori: I guess most hhvm issues we got reported came from memory exhaustion; I rolling restarted all hhvm appservers on wednesday, we may need to do that again during this weekend or on monday [17:10:32] RECOVERY - RAID on nickel is OK: OK: Active: 1, Working: 1, Failed: 0, Spare: 0 [17:12:23] PROBLEM - Host nickel is DOWN: PING CRITICAL - Packet loss = 100% [17:17:33] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [17:19:54] PROBLEM - SSH on nickel is CRITICAL: Connection refused [17:19:54] PROBLEM - Disk space on nickel is CRITICAL: Connection refused by host [17:19:54] PROBLEM - check if dhclient is running on nickel is CRITICAL: Connection refused by host [17:19:54] PROBLEM - puppet last run on nickel is CRITICAL: Connection refused by host [17:20:05] PROBLEM - check if salt-minion is running on nickel is CRITICAL: Connection refused by host [17:20:05] PROBLEM - check configured eth on nickel is CRITICAL: Connection refused by host [17:20:22] PROBLEM - HTTP on nickel is CRITICAL: Connection timed out [17:20:42] PROBLEM - RAID on nickel is CRITICAL: Timeout while attempting connection [17:20:47] Anyone working on poor nickel? [17:20:48] PROBLEM - DPKG on nickel is CRITICAL: Timeout while attempting connection [17:23:31] Coren: in the backscroll, cmjohnson says he's replacing a disk on nickel. [17:24:32] PROBLEM - Host nickel is DOWN: CRITICAL - Plugin timed out after 15 seconds [17:29:07] (03CR) 10Ricordisamoa: [C: 04-1] "Many minor changes are already in review as I0fad583a66e71e02e9a38b359e60d238167825ef." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [17:32:06] (03CR) 10Glaisher: "Is that a reason to -1 this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [17:35:25] (03PS1) 1001tonythomas: Make BounceHandler extension work on en-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 [17:37:14] RECOVERY - SSH on nickel is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7.1 (protocol 2.0) [17:37:23] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 1.65 ms [17:37:49] (03CR) 10Ricordisamoa: "It is generally not wise to introduce minor changes that duplicate an existing changeset." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [17:39:23] RECOVERY - check configured eth on nickel is OK: NRPE: Unable to read output [17:39:42] RECOVERY - DPKG on nickel is OK: All packages OK [17:39:43] RECOVERY - RAID on nickel is OK: OK: Active: 1, Working: 1, Failed: 0, Spare: 0 [17:44:21] (03CR) 10Glaisher: minor changes to InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 (owner: 10Ricordisamoa) [17:44:41] (03PS22) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [17:46:10] cmjohnson: are you done? [17:47:06] [ 184.175555] EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 676513 [17:47:09] [ 184.184438] Aborting journal on device dm-0-8. [17:47:12] [ 184.189128] EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal [17:47:15] [ 184.200031] EXT4-fs (dm-0): Remounting filesystem read-only [17:47:17] fun [17:47:22] why did we do this on friday again? :) [17:48:17] paravoid: after the new disk..it didn't boot and to avoid longer delays i put the old disk back in and now we're getting these ext4-fs errors [17:48:21] (03CR) 10Ori.livneh: [C: 032] contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [17:49:06] i assume the original issue is with grub...although I verified that grub was on /dev/sdb before doing anything [17:49:10] rebooting it [17:50:23] I wonder if Alex's rsync finished [17:50:33] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:54] Hm.. how should I puppetise when there can be multiple instances of something but they depend on something common, where do I define that common resource? E.g. for the testing "localhost" apache we use contint::localvhost resources, and they get specified docroot like /srv/localhost/, we currently repeat the resource for /srv/localhost in three places. [17:51:00] That conflicts when there is more than one on one node [17:51:02] PROBLEM - Host nickel is DOWN: PING CRITICAL - Packet loss = 100% [17:51:33] Where can I define a resource (e.g. a File) that but only once regardless of how many times a resource of that type is declared? [17:52:22] Krinkle: if there is a unifying commonality, it's better to abstract it out and have each place include it [17:52:32] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [17:53:03] Krinkle: in a pinch you can do ensure_resources() [17:53:22] RECOVERY - puppet last run on nickel is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:53:24] RECOVERY - check if salt-minion is running on nickel is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:53:36] second disk broken too [17:53:41] [ 108.072953] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [17:53:44] [ 108.079384] ata1.00: BMDMA stat 0x24 [17:53:46] ori: So in this case we have 1) https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/ci.pp#L384-L396 and 2) https://github.com/wikimedia/operations-puppet/blob/production/modules/contint/manifests/qunit_localhost.pp [17:53:47] [ 108.082949] ata1.00: failed command: READ DMA [17:53:50] both using https://github.com/wikimedia/operations-puppet/blob/production/modules/contint/manifests/localvhost.pp [17:53:57] but mdstat is very weird [17:54:04] I need both on the same node soon, so I need a place to put the /srv/localhost [17:54:07] are we sure we didn't boot from the wrong disk? [17:54:10] Krinkle: "/srv/localhost" is pretty weird [17:54:32] i switched bios back to boot from port A [17:54:35] ori: it's like /srv/org/wikimedia/foo we have /srv/localhost/{qunit,mediawiki} [17:54:43] but for ports instead of subdomains [17:54:51] cmjohnson: yeah, you've booted from the broken disk [17:55:10] ori: /var/www seemed reserved and too generic to call dibs on. Happy to put it elsewhere. [17:55:17] cmjohnson: poweroff and remove that disk [17:55:33] I put the broken disk back in because I couldn't get nickel to boot with the new disk [17:55:36] Krinkle: taking a look, sec [17:55:41] (03CR) 10Dzahn: "oops, broke puppet run on bast1001:" [puppet] - 10https://gerrit.wikimedia.org/r/167885 (owner: 10Dzahn) [17:56:07] cmjohnson: the RAID between the old disks is broken so they act as two independent disks [17:56:26] cmjohnson: and the broken one has stale data, back from September 2nd [17:56:32] cmjohnson: so now we've booted with that [17:56:40] cmjohnson: unplug that disk and reboot [17:57:05] Krinkle: in this case it seems like hashar was lazy about fixing a group issue [17:57:22] andrewbogott: would it be possible to add a 'jenkins-deploy' group in labs? [17:57:49] cmjohnson: hm wait a sec [17:58:12] ori: you mean in ldap? It's easy enough, let me make sure there isn't one already... [17:58:16] ok [17:58:22] andrewbogott: thanks [17:58:23] !log stat1001 - Duplicate declaration: Package[nodejs] [17:58:29] Logged the message, Master [17:58:48] paravoid: i just need to be able to boot from /dev/sdb which I couldn't get to earlier [17:59:03] ottomata: there's a conflict between statistics.pp and limn module on stat1001 [17:59:12] cmjohnson: yeah, fixed [17:59:14] mutante: let me take a look for a sec [17:59:22] cmjohnson: shutdown and unplug that broken disk [17:59:23] ori: cool,thx [18:00:17] paravoid: details? what did i miss? [18:00:28] (03PS1) 10Ori.livneh: misc::statistics: use require_package('nodejs'), as we do elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/168625 [18:00:33] mutante: ^ [18:00:41] cmjohnson: I did a "grub-install /dev/sdb" [18:01:30] PROBLEM - Host nickel is DOWN: PING CRITICAL - Packet loss = 100% [18:01:38] ori: Hm.. what do you mean? [18:02:13] (03CR) 10Dzahn: [C: 031] "yes, thank you. that should fix the duplicate declaration on stat1001" [puppet] - 10https://gerrit.wikimedia.org/r/168625 (owner: 10Ori.livneh) [18:02:16] ori: Should the user be in puppet? [18:02:18] ori: there's already a mwdeploy group; this would be the same as that? [18:02:22] Can you explain what it's for? [18:02:22] ori: want me to merge ? [18:02:26] and it depend on that (via a parameter I guess) [18:02:27] sure [18:02:45] (that was @mutante) [18:03:01] (03CR) 10Dzahn: [C: 032] "Duplicate declaration: Package[nodejs]" [puppet] - 10https://gerrit.wikimedia.org/r/168625 (owner: 10Ori.livneh) [18:03:27] andrewbogott: it's so we can avoid having ugly workarounds like /srv/localhost declared separately in role/ci.pp with comment " group => 'root', # no jenkins-deploy group in labs " [18:05:12] ori: Hm.. but it seems they also differ as jenkins-deploy and jenkins-slave between the two uses [18:05:21] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:05:30] interesting that the owner is fine but not the group? [18:05:37] (03CR) 10Dzahn: "yep. fixed. RECOVERY - puppet last run on stat1001 is OK" [puppet] - 10https://gerrit.wikimedia.org/r/168625 (owner: 10Ori.livneh) [18:05:47] ugh [18:05:57] Krinkle: might want to take it up with hashar [18:06:29] (03PS4) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [18:06:33] ori: let's say for now the same users exist and that's not an issue. What about the resource in general [18:06:47] Krinkle: the resource would stay in the module and move out of the role [18:07:17] I don't think the solution should have the user/group hardcoded in the definition, it should be fine to pass different values in prod and labs. [18:07:44] as long as I can use it in multiple labs roles that are applies to the same role/node [18:07:52] so move it out of the module and have a separate ::foo::production / ::foo::labs roles [18:07:55] or use hiera [18:07:59] but i gotta run, sorry [18:08:02] noam's up [18:08:33] ori, Krinkle, I'm happy to create groups -- make me a bug once you figure out what you need. [18:08:53] I don't know about the groups, it's a strange setup hashar made. Don't create anything yet. [18:09:08] I can work around that for now, my problem is with something else. [18:09:13] I'll get back if I need anything, thanks. [18:09:24] (03PS1) 10BBlack: allow zero-length const_string_add() [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/168626 [18:09:48] (03CR) 10BBlack: [C: 032 V: 032] allow zero-length const_string_add() [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/168626 (owner: 10BBlack) [18:10:00] paravoid: i get a grub rescue prompt [18:10:12] grmbl [18:10:29] (03CR) 10Dzahn: move 'noc' from misc to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168006 (owner: 10Dzahn) [18:15:20] cmjohnson: can I grab the console and poke around a bit? [18:15:26] sure [18:15:33] all yours [18:16:00] bblack ..i couldn't find a /boot [18:16:16] ok [18:16:37] I just want to catch up to wherever paravoid was at and confirm. then we'll probably end up booting the bad disk again and try to re-fix the good one. [18:16:47] (03PS1) 10Kaldari: Adding WikiGrok to extensions list for testing on Beta Labs, etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168628 (https://bugzilla.wikimedia.org/72465) [18:17:33] (03CR) 10Kaldari: [C: 032] Adding WikiGrok to extensions list for testing on Beta Labs, etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168628 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [18:17:40] (03Merged) 10jenkins-bot: Adding WikiGrok to extensions list for testing on Beta Labs, etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168628 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [18:19:16] (03PS1) 10Krinkle: contint: Minor clean up [puppet] - 10https://gerrit.wikimedia.org/r/168629 [18:19:18] (03PS1) 10Krinkle: contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 [18:19:20] (03PS1) 10Krinkle: [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 [18:20:00] (03CR) 10jenkins-bot: [V: 04-1] contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [18:20:22] (03CR) 10jenkins-bot: [V: 04-1] [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 (owner: 10Krinkle) [18:20:27] cmjohnson: is the bad disk still out, or back in? [18:20:38] it's still out [18:20:59] (03PS2) 10Krinkle: contint: Move /srv/localhost/qunit resource out of qunit_localhost class [puppet] - 10https://gerrit.wikimedia.org/r/168630 [18:21:05] (03PS2) 10Krinkle: [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 [18:21:24] cmjohnson: bios shows 2x 500GB disks on sata ports A + B [18:21:45] correct...there is a new disk in port A [18:21:50] ah! [18:21:53] (03PS1) 10MaxSem: Revert "Adding WikiGrok to extensions list for testing on Beta Labs, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168633 [18:22:02] (03CR) 10MaxSem: [C: 032] Revert "Adding WikiGrok to extensions list for testing on Beta Labs, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168633 (owner: 10MaxSem) [18:22:12] (03Merged) 10jenkins-bot: Revert "Adding WikiGrok to extensions list for testing on Beta Labs, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168633 (owner: 10MaxSem) [18:22:46] can we try booting with just the old unfailed disk in port B, and leaving A unplugged for now? [18:23:02] sure [18:23:07] ...give me a sec [18:23:09] (is this stuff hotpluggable at the hw level btw?) [18:23:20] nope...internal [18:23:43] can you shutdown [18:23:48] oh yeah [18:24:06] (03CR) 10Nuria: [C: 031] "Tested on vagrant. Please Andrew check that these changes are likely to work in staging/prod." [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [18:24:09] well I disconnected anyways, it's back on a grub rescue prompt [18:25:53] bblack...booting [18:28:00] (03PS4) 10Dzahn: move 'noc' from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/168006 [18:29:22] (03PS2) 1001tonythomas: Make BounceHandler extension work on en-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 [18:30:03] (03CR) 10jenkins-bot: [V: 04-1] Make BounceHandler extension work on en-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:30:09] cmjohnson: I'm getting nothing on serial cons, is anything happening there? [18:30:39] oh there we go, grubrescue again [18:30:43] had to hit f1 [18:31:05] hmmmm [18:32:40] ok well let's go back to the original disk setup then? at least that boots far enough to investigate grub [18:32:47] (bad disk in A, old good disk in b) [18:32:48] cmjohnson: ^ [18:33:10] so back to original setup [18:33:31] (03PS5) 10Dzahn: move 'noc' from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/168006 [18:33:43] yeah [18:33:47] btw I love this: [18:33:49] powerdown - power server off [18:33:49] powerup - power server onpowerdown - power server off [18:33:50] (03CR) 10Jgreen: [C: 04-1] Make BounceHandler extension work on en-wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:34:15] why not call them poweroff and poweron, if that's how they're described anyways? :p [18:34:22] lots of options for you [18:34:45] I swear every time I first try "racadm serveraction poweroff", then get an error, then check help, then I find out it's "powerdown means power off" [18:34:55] maybe this time I'll remember since I commented about it [18:35:04] (03CR) 10Hoo man: [C: 04-1] Make BounceHandler extension work on en-wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:36:26] (03CR) 10Hoo man: Make BounceHandler extension work on en-wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:38:23] bblack booting [18:38:50] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [18:39:32] ok I'm gonna go futz with grub some [18:41:40] cmjohnson: nickel's sda and sdb seem radically different in usage and parition layout, was this really a simple mirror situation? [18:42:04] bblack..it was but I failed the disk and then removed it before powering off to replace [18:42:19] I mean it doesn't seem like it was [18:42:41] it seems like, perhaps once upon a time it was, but sometime later someone killed the sdb half of the root-mirror stuff, and put 3 partitions on sdb to use for unmirrored data [18:42:53] or something like that [18:43:01] /dev/sda1 1 54722 439545856 fd Linux raid autodetect [18:43:04] /dev/sda2 59829 60802 7811072 82 Linux swap / Solaris [18:43:07] ^ looks "normal" [18:43:12] /dev/sdb1 1 1216 9764864 fd Linux raid autodetect [18:43:15] /dev/sdb2 1216 1338 976896 fd Linux raid autodetect [18:43:18] /dev/sdb3 1338 60802 477642752 fd Linux raid autodetect [18:43:21] ^ something else entirely? [18:44:35] wasn't sdb replaced? by a disk that already had partitions i guess? [18:44:39] oh this is using LVM [18:45:25] sdb is the supposedly-good disk, sda is the one that's failing and we tried to replace [18:45:31] but currently everything's back where it was, I think? [18:45:38] ah yes, sorry i misread [18:46:35] it is back to the original disks [18:46:47] in the original slots as well? [18:46:57] yes [18:47:00] because it kinda seems like sdb is the one giving errors currently [18:47:25] sdb has not moved or changed in any way afaik [18:47:50] paravoid did a "grub-install /dev/sdb" [18:47:59] which does the bootsector [18:48:11] but it's not going to help if /dev/sdb doesn't contain a boot/root parition copy as well [18:48:19] I think he must have been assuming a standard mirror setup [18:49:54] before turning off I ran mdadm --manage /dev/md0 --fail /dev/sda1 mdadm --manage /dev/md0 --remove /dev/sda1 [18:50:56] even the install_server stuff in puppet says nickel was originally intended to have normal raid1 layout [18:52:07] you know what..hindsight...when I did cat /proc/mdstat...i only recall only seeing md0 .... [18:52:10] no md1 [18:52:15] I can't even make logical sense of what I see here right now in /proc/mdstat and that an LVM volume is mounted as root [18:52:29] there's currently 4x entries in /proc/mdstat [18:52:50] md0, and md_d0, md_d1, md_d2 [18:53:03] that's way different than what I had [18:53:19] (with md0 being on sda1 with a failed other-half, and the others being the 3 partitions on sdb) [18:53:24] just md0 and it showed sda1 as failed and sdb as normal [18:54:05] is there an RT or something for the original disk failure? [18:54:17] rt8252 [18:54:44] (03PS2) 10Cmjohnson: Adding netboot and dhcpd for elastic1020-1031 [puppet] - 10https://gerrit.wikimedia.org/r/168597 [18:55:32] !log repooled mw1189 to do heap profiling on production api workload. [18:55:36] the only thing that would make of this make sense to me right now is if what's currently "/dev/sdb" is in fact not the original /dev/sdb [18:55:41] Logged the message, Master [18:56:36] don't know how that is possible...didn't move it....there is the bios change to boot port B right now...doubt that would make the difference [18:56:56] sda does seem to be the original sda, because its data stopped updating Sept 2nd, which lines up with the ticket [18:57:01] (03CR) 10Cmjohnson: [C: 032] Adding netboot and dhcpd for elastic1020-1031 [puppet] - 10https://gerrit.wikimedia.org/r/168597 (owner: 10Cmjohnson) [18:57:09] but one would expect sdb to look like sda [18:58:08] bblack: IIRC ganglia was not writing to disk but to tmpfs, and the init script was hacked to sync the tmpfs to disk on service stop/start, so the sept 2 mtime may simply be the last time that happened [18:58:41] well even lastlog for logins stopped at sep2 [18:58:47] oh, ok [18:59:05] (and syslogs and such) [18:59:34] and there's a disk error on sdb when booting up currently, which also doesn't make sense [18:59:40] (03PS3) 1001tonythomas: Make BounceHandler extension work on en-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 [19:00:43] did someone reconfigure the storage on this machine since the sda breakage on sept2? [19:01:23] I don't think so [19:01:31] well not that I know of anyway [19:02:05] it's not in SAL either [19:03:50] (03PS5) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [19:08:20] bblack: graphite has disk metrics from nickel under the server.nickel.* namespace, might be useful for sleuthing. i see entries for sd* under server.nickel.iostat [19:09:11] (03CR) 10Ottomata: Add centralauth to puppet db_config.yaml (031 comment) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [19:09:33] well the only sane thing to assume now is that sda is very messed up in some way [19:09:59] and that sdb's 3x partitions represent the normal layout, what it was running on up until today (10G root, swap, /srv (unused)) [19:11:22] yes, the stats agree with that [19:12:11] (03CR) 10Dzahn: [C: 032] "the only diff is the intended renaming of the role "noc-wikimedia" to just "noc", we want to avoid "-" in role names" [puppet] - 10https://gerrit.wikimedia.org/r/168006 (owner: 10Dzahn) [19:13:35] (03CR) 1001tonythomas: [C: 031] "Tested the configuration live in our labs instance ( verpremotemx ) and got the following in the exim4.conf" [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [19:14:22] (03CR) 10Dzahn: "noc is now module , noop on terbium , except update-motd.d/05-role-role--noc]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/168006 (owner: 10Dzahn) [19:14:55] (03CR) 10Dzahn: "the "role-role" stuff is another story, it's everywhere :p meh" [puppet] - 10https://gerrit.wikimedia.org/r/168006 (owner: 10Dzahn) [19:15:41] bblack: what do you think the best course of action is ? [19:16:04] I'm not sure yet, I'm still investigating [19:16:24] my default at the moment would be replace both disks and start over from scratch and restore the data, assuming we have data to restore [19:16:30] but I'm still looking for something less dire than that [19:17:10] I can mount /dev/sdb1 (aka currently /dev/md_d0), and it does look sort of like a root filesystem [19:17:22] but one that is in some initial state from back in April or so [19:17:37] almost like it was removed the array during its own initial installation and never quite finished setting up [19:17:59] there's nothing in /boot, there's no /var/log/syslog or rotations of it, no fs dates past April, except a few from some failed boot attempt today. [19:18:50] yeah I donno what it is really [19:18:54] it looks more like an installer disk [19:19:37] I wonder if april was when this machine was installed? [19:19:46] (in which case raid's been broken ever since) [19:20:51] stashed a temporary copy of rrd_rootdir "/mnt/ganglia_tmp/rrds.pmtpa" [19:20:51] on lithium:/srv/nickel_ganglia just in case [19:20:53] fillipo [19:20:59] fillipo did that ^ [19:21:18] and alex was doing something today [19:21:21] with a box called uranium [19:21:23] but I have no clue what [19:21:27] I can call him, it's not too late [19:22:33] called, no answer [19:23:24] well so sda1 is what it is: the state of the machine frozen on sept 2 [19:23:32] i wonder if i should revert this or just let it be for now .. https://gerrit.wikimedia.org/r/#/c/167885/3/manifests/nfs.pp [19:23:36] yeah that's what I saw too [19:23:38] what about sdb? [19:23:52] and the data in sdb1 (which has a different partition layout!) looks like an initial rootfs from an OS installer, it has almost nothing and dates back to April [19:24:02] whaa..?! [19:24:04] those are the only two viable candidate rootfs-like things I see [19:24:23] paravoid: look on nickle (currently booted on sda), at /mnt/sdb1 (which is readonly) [19:24:45] it doesn't have normal contents for /lib, /var/log, etc. it really looks like an installer partition of some kind that never got very far [19:24:57] nothing in /boot either [19:25:37] (and then there's a few dates from today as well, beats me, maybe some aborted boot attempt earlier) [19:25:40] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [19:26:15] and where are the rrds? [19:26:28] the backups? [19:26:32] no [19:26:43] yeah the rrds are the least of the problem with sdb1 [19:26:47] hmm.. the DHCP server broke i think [19:26:50] yeah I saw [19:26:56] I mean, if sda is not it, and sdb is not it [19:27:00] where the fuck are they :) [19:27:12] the Cloud [19:27:13] were they supposed to be in the /srv partition? [19:27:26] no, /var/lib [19:27:49] there's a strange deal with ganglia, there's a tmpfs [19:28:01] and a script that copies it there on boot and on shutdown (and I think with a cronjob as well) [19:28:09] I really think something went catastrophically wrong back in April, probably during the installation of this machine, and was never noticed [19:28:09] because the disks couldn't sustain the rrd i/o traffic [19:28:12] and we never really had working raid [19:28:28] that's the best I can guess from what I see anyways [19:28:36] what's the contents of sdb3? [19:28:48] well there's no fstab for mounting /srv anyways, but it should be /srv [19:28:51] lemme look [19:28:56] it's 489G [19:29:13] empty filesystem [19:29:28] config error in linux-host-entries broke dhcp server .. looking [19:29:54] all of /dev/sdb looks like it was laid out for initial install, and the installer dropped a few essential pre-install essential bits into the rootfs, and then died, in April. [19:30:19] yes seems so [19:30:23] this looks like a debootstrap attempt [19:30:26] and then /dev/sda has a different partition sizing/layout that actually looks functional, but stopped getting updates on Sept 2 [19:30:37] so how is the box running the past 2 months? [19:30:38] so where the hell was the data going the last 6 weeks or so? [19:30:45] well the data was going to tmpfs [19:30:52] /mnt/ganglia_tmp [19:30:56] yeah but what about syslog and such? [19:30:58] because that's how the box is set up [19:31:08] surely it couldn't have kept running with no available disks to write to, right? [19:31:22] it /could/, but wouldn't we get an alert from e.g. puppet [19:31:28] puppet failures I mean [19:31:52] cmjohnson: are you 100% sure these are the original disks? [19:31:59] 100% [19:32:05] well assuming there aren't other parts of this story we don't know, the other possibility is that we've been running on /dev/sda since, but in readonly mode [19:32:07] (03CR) 10Dzahn: "Oct 24 19:28:18 carbon dhcpd: /etc/dhcp3/linux-host-entries.ttyS1-115200 line 1450: expecting a network hardware type" [puppet] - 10https://gerrit.wikimedia.org/r/168597 (owner: 10Cmjohnson) [19:32:13] and ganglia was ok with a readonly rootfs because it had the tmpdir to write to [19:32:19] yes that's what I meant "it could" [19:32:26] but at least puppet would fail [19:32:31] and we'd presumably get puppet failures [19:32:41] [21:02:20] ACKNOWLEDGEMENT - RAID on nickel is CRITICAL: CRITICAL: Active: 1, Working: 1, Failed: 1, Spare: 0 daniel_zahn RT: 8252 Disk fail () [19:32:53] Coren is the one that has opened the ticket [19:32:57] maybe he remembers? [19:32:57] (03PS1) 10Dzahn: fix typo in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/168648 [19:33:13] so, reinstall, use fillipo's backups is what I think happens now regardless. [19:33:13] * Coren reads scrollback. [19:33:40] bblack: alex was setting up uranium as a trusty/ganglia replacement [19:33:41] why would puppet fail? [19:33:46] I think he's succeded to some degree [19:34:04] surely puppet runs would write to the rootfs in the course of our normal ops/puppet config changes over the past month or two [19:34:05] Coren: when you logged in to nickel back in September 2nd and noticed the failed disk, what else did you notice [19:34:15] (03PS2) 10Dzahn: fix typo in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/168648 [19:34:28] if nothing else they're write out their state data in /var? [19:34:43] (03CR) 10Dzahn: [C: 032] fix typo in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/168648 (owner: 10Dzahn) [19:35:30] paravoid: is uranium actually ready? [19:35:41] I have no clue :) [19:36:03] I suspect it must give better http output than: There was an error collecting ganglia data (127.0.0.1:8654): fsockopen error: Connection refused [19:36:11] maybe I should flip over to it and see if it looks ok :) [19:36:25] paravoid: I don't remember anything specific; failed disk in a mirror, I marked it faield in the array and opened the ticket. AFAIK, after the reboot, things when back to normal (but with a degraded array) [19:36:36] which reboot? [19:36:52] Didn't I reboot the box then? I thought I did. [19:37:00] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:37:00] no idea [19:37:00] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Puppet has 1 failures [19:37:10] lastlog will tell. [19:37:16] (03CR) 10Dzahn: "isc-dhcp-server start/running, process 3618" [puppet] - 10https://gerrit.wikimedia.org/r/168648 (owner: 10Dzahn) [19:37:28] if only if we had a root fs to run that on :) [19:37:41] IIRC I had to reboot it because the filesystem had failed to readonly. [19:37:55] but there was an array you mdadm'ed? [19:38:09] Yeah, marked the second disk as failed. [19:38:29] man this is crazy [19:38:45] * Coren tries to catch up with the backlog to understand what's up. [19:38:50] (03CR) 10Dzahn: "fixed in I01154ffe8d55d29 . DHCP server is up. you should now be able to install" [puppet] - 10https://gerrit.wikimedia.org/r/168597 (owner: 10Cmjohnson) [19:39:06] mutante th [19:39:07] thx [19:39:16] I need to context-switch a bit to fix Jeff_Green's thing which is a bit urgent [19:39:16] yw [19:40:42] * Coren fails to understand wth happened with the partitions. [19:41:43] well uranium seems to mostly work and have very-recent data in it, going back a month or so [19:41:52] As far as I recall, this was a straightforward raid1 deal. When the disk gave errors, I failed the disk in the array, rebooted to restore the filesystem, and all was well (except for the degraded array) [19:42:08] how about I flip DNS over for that, and then we can look at restoring the old data into it from fillipo's backup? [19:42:41] (well I'm doing the DNS thing now regardless, because it's better than nothing for now) [19:42:59] bblack: Agreed; some older data is better than no monitoring. [19:43:21] (03PS1) 10BBlack: ganglia -> uranium, because we planned it like this [dns] - 10https://gerrit.wikimedia.org/r/168650 [19:43:22] newer [19:43:31] we get all the latest data there, we just don't have long-term history [19:44:03] (03CR) 10BBlack: [C: 032] ganglia -> uranium, because we planned it like this [dns] - 10https://gerrit.wikimedia.org/r/168650 (owner: 10BBlack) [19:44:52] (03PS6) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [19:45:54] the UI looks a little different, and things seem to load slower, and long-term history is gone. but otherwise it seems to work [19:46:12] oh not all the long-term is gone, I see some year-long data for some things now [19:46:34] (03CR) 10Dzahn: [C: 031] "adding base::firewall is cool! i went to analytics1023 and yea, i see java listening on 2181 and 2183 though not on 2182 right now. i see " [puppet] - 10https://gerrit.wikimedia.org/r/168185 (owner: 10Ottomata) [19:46:52] e.g. http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=cp1057.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Bits+caches+eqiad [19:47:08] (well assuming you've wiped your local DNS cache or hacked /etc/hosts to make that use uranium) [19:47:16] ottomata: arrr, i did not intend to merge it, ! arr, i just added +1 [19:47:31] ottomata: but it happened because it was already +2 from alex :p .. what now [19:47:42] eh? [19:47:49] that change adding the ferm rules [19:47:49] uh oh [19:47:51] for zookeper [19:47:52] oh [19:47:55] it is ok [19:47:57] mutante: [19:48:01] base firewall isn't actually applied yet [19:48:09] pheew, ok :) yea, what you just said [19:48:19] i will run puppet on one and super doulbe check [19:48:20] so, i see java opening some other ports [19:48:27] but i havent checked what they are [19:48:32] thanks! [19:48:43] yes, i'm not totally sure eitiher [19:48:44] (03CR) 10Greg Grossmeier: "Planned for Tuesday Oct 28th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (owner: 10Jforrester) [19:48:46] merging on puppetmaster [19:48:50] ok [19:50:05] ja, think things are fine there [19:50:20] good:) [19:50:29] (03PS3) 10Greg Grossmeier: Switch from SpecialCite to CiteThisPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (https://bugzilla.wikimedia.org/71112) (owner: 10Jforrester) [19:50:58] (03PS7) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [19:51:53] (03PS1) 10BBlack: default gmetad host -> uranium for monitor_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/168655 [19:52:28] (03CR) 10BBlack: [C: 032 V: 032] default gmetad host -> uranium for monitor_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/168655 (owner: 10BBlack) [19:52:41] let's leave things like that for the weekend at least [19:53:02] (03PS1) 10Ori.livneh: hhvm::monitoring: typo fix for ganglia module [puppet] - 10https://gerrit.wikimedia.org/r/168658 [19:53:09] bblack: +1 [19:53:22] (03PS8) 10Ottomata: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [19:53:25] !log nickel's basically dead, uranium has been promoted to prod ganglia a little early for now [19:53:28] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm::monitoring: typo fix for ganglia module [puppet] - 10https://gerrit.wikimedia.org/r/168658 (owner: 10Ori.livneh) [19:53:30] Logged the message, Master [19:53:33] (03CR) 10Ottomata: [C: 032 V: 032] Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [19:54:01] bblack: i merged your patch since it was queued [19:54:08] it told me I merged yours! [19:54:56] ---->*<---- (that's the whole git repo colliding with its anti-repo and becoming energy) [19:55:15] you can be my wingman anytime, bblack [19:55:17] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:55:55] actually i think you may have won by a hair [19:56:02] since i see 'Already up-to-date.' right after i purportedly merged it [19:57:58] ottomata: 2101 would be this : -Dcom.sun.management.jmxremote.port=2101 [19:58:19] (03PS3) 10Rush: Phabricator: repository.default-local-path to proper location [puppet] - 10https://gerrit.wikimedia.org/r/168611 (owner: 10Chad) [19:58:25] (03CR) 10Rush: [C: 032 V: 032] Phabricator: repository.default-local-path to proper location [puppet] - 10https://gerrit.wikimedia.org/r/168611 (owner: 10Chad) [19:58:56] hmm [19:59:39] according to my documentation, mutante, that should be https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Ports#JMX [19:59:43] 9998 [20:00:03] ottomata: http://stackoverflow.com/questions/20884353/why-java-opens-3-ports-when-jmx-is-configured [20:00:04] AH [20:00:11] that is the jmx port of the jmxtrans process [20:00:12] that's fine [20:00:14] Why Java opens 3 ports when JMX is configured? [20:00:16] that can be blocked. [20:00:16] !log reedy Synchronized php-1.25wmf4/extensions/SemanticForms/: noop for prod (duration: 00m 16s) [20:00:21] Answer: It is Java's bug [20:00:21] :p [20:00:24] Logged the message, Master [20:00:26] Ugh [20:00:27] That was painful [20:00:44] ottomata: cool [20:00:44] ssh: connect to host mw1041 port 22: No route to host [20:01:13] !log mw1088 has a full / [20:01:20] Logged the message, Master [20:01:55] !log mw1041 is down [20:01:59] Logged the message, Master [20:02:26] Soemone fancy power cycling mw1041? [20:02:28] (03CR) 10Dzahn: "so 2101 is this:" [puppet] - 10https://gerrit.wikimedia.org/r/168185 (owner: 10Ottomata) [20:03:03] Reedy: ok [20:03:04] !log reedy Synchronized php-1.25wmf5/extensions/SemanticForms/: noop for prod (duration: 00m 17s) [20:03:10] Logged the message, Master [20:04:32] !log powercycled mw1041 [20:04:40] Logged the message, Master [20:05:06] mw1087 [20:05:09] /dev/sda1 222G 37G 175G 18% / [20:05:14] mw1088 [20:05:16] /dev/sda1 222G 211G 0 100% / [20:06:11] Reedy: [ 16.996722] bnx2 0000:01:00.0: eth0: NIC Copper Link is Down [20:06:27] that doesn't sound good :) [20:06:34] well, like unplugged :p [20:06:45] RECOVERY - Host mw1041 is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [20:06:49] eh.. lol? [20:07:27] 163G /usr/local/apache/core [20:07:54] TimStarling: About? [20:08:00] You enabled core dumps on mw1088, / is now full [20:08:11] Reedy: mw1041 is back nevertheless [20:08:23] mutante: thanks [20:08:25] yw [20:09:07] !log running sync-common on mw1041 [20:09:14] Logged the message, Master [20:09:18] :) [20:09:25] ori: about? Know anything about mw1088? [20:09:49] capturing segfault coredumps for https://bugzilla.wikimedia.org/show_bug.cgi?id=71519 [20:09:51] but / is full [20:11:17] pff [20:11:35] I am pretty sure Ifilled a bug about hhvm filling disks due to coredumps [20:11:42] heh [20:11:50] gzip other_vhosts_access.log.1 [20:11:52] i'm running it [20:12:01] Tim seemingly explicitly enabled it [20:12:02] 00:34 Tim: core dumps were enabled on mw1088, unexpectedly started gathering natural segfault traffic [20:12:31] mutante: Reckon we could just gzip some of the older core dumps? [20:13:40] No idea if they all serve some use [20:13:45] maybe let this finish first? where are they [20:13:52] i mean the log is also huge [20:13:59] and it's .1 , not current [20:14:03] reedy@mw1088:/usr/local/apache/core$ du --si . [20:14:03] 175G . [20:14:26] :o [20:15:03] !log / full on mw1088 due to apache core dumps [20:15:11] Logged the message, Master [20:15:21] <_joe_> apache core dumps? [20:15:29] O.O [20:15:37] !log mw1088 - gzip other_vhosts_access.log.1 - Avail. 38G [20:15:41] arg. 3.8! [20:15:44] Logged the message, Master [20:15:47] Reedy: 3.8 free [20:15:53] heh [20:16:17] <_joe_> if you don't disable core dumps for apache, it's gonna fill up again [20:16:35] <_joe_> I actually remember tim debugging some issue last week [20:16:42] <_joe_> now I'm really off sorry [20:16:51] <_joe_> mutante: do disable core dumps [20:20:45] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: puppet fail [20:20:55] PROBLEM - NTP on mw1041 is CRITICAL: NTP CRITICAL: Offset unknown [20:22:57] !log mw1088 - gzipping core dump files, disabled core dumps, restarted apache [20:23:04] Logged the message, Master [20:23:05] _joe_: Reedy , yep, done [20:23:47] Reedy: still zipping more old ones ... find .. [20:24:49] (03PS1) 10Kaldari: Adding WikiGrok to extensions list for testing on Beta Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168688 (https://bugzilla.wikimedia.org/72465) [20:26:04] RECOVERY - NTP on mw1041 is OK: NTP OK: Offset -0.0006600618362 secs [20:26:14] (03PS1) 10RobH: allocate two servers for ipsec testing [dns] - 10https://gerrit.wikimedia.org/r/168689 [20:27:39] (03CR) 10RobH: [C: 032] allocate two servers for ipsec testing [dns] - 10https://gerrit.wikimedia.org/r/168689 (owner: 10RobH) [20:28:57] !log sync-common on mw1088 [20:29:02] (03CR) 10MaxSem: [C: 032] Adding WikiGrok to extensions list for testing on Beta Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168688 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [20:29:04] Logged the message, Master [20:29:09] (03Merged) 10jenkins-bot: Adding WikiGrok to extensions list for testing on Beta Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168688 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [20:32:48] _joe_, you have an unmerged change in mediawiki-config [20:33:27] MaxSem: what's the change? We patched wikitech earlier but I'm pretty sure I merged it... [20:33:45] andrewbogott, https://gerrit.wikimedia.org/r/#/c/168067/ [20:33:54] not merged on tin [20:33:55] RECOVERY - Disk space on mw1088 is OK: DISK OK [20:34:13] oh, duh, I merged but didn't rebase first [20:34:18] I will do that now! [20:36:01] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 04s) [20:36:06] MaxSem: better? [20:36:07] Logged the message, Master [20:36:24] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:36:35] woo [20:36:48] (03PS1) 10RobH: reclaiming al-fundraising.wikimedia.org in public1-a-eqiad subnet ip addresses [dns] - 10https://gerrit.wikimedia.org/r/168691 [20:40:10] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:43:54] PROBLEM - exim incoming message rate on iodine is CRITICAL: exim_messages_in CRITICAL: 0.0 [20:44:52] (03CR) 10RobH: [C: 032] reclaiming al-fundraising.wikimedia.org in public1-a-eqiad subnet ip addresses [dns] - 10https://gerrit.wikimedia.org/r/168691 (owner: 10RobH) [20:46:27] (03CR) 10Cscott: [C: 04-1] "This isn't the fix you want, I don't think." [puppet] - 10https://gerrit.wikimedia.org/r/168536 (owner: 10Springle) [20:51:28] (03CR) 10Cscott: "Minor nit to pick regarding the name of the new configuration file." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/166610 (owner: 10Subramanya Sastry) [20:52:37] (03CR) 10Dzahn: "yay :) thank you Alex, best kind of -1 there is" [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [20:52:49] !log revived virt1006 on a probationary basis. It's running compute but is disabled so new instances won't be scheduled there. I've moved a few test instances there to see how it behaves. [20:52:55] Logged the message, Master [20:55:52] (03CR) 10Hashar: "Note that to work on the beta cluster an extension has to be registered in mediawiki/extensions.git using the sync-with-gerrit.py script a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168688 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [20:57:47] andrewbogott, yup! :) [21:04:26] (03PS5) 10Dzahn: Tampa decom - clean up 152.80.208.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/167868 [21:06:03] i guess i don't need public IPs, i'm simulating varnish to varnish inter-colo traffic [21:06:06] oop [21:06:11] *tab* [21:07:37] (03PS6) 10Dzahn: Tampa decom - clean up 152.80.208.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/167868 [21:09:51] (03CR) 10Dzahn: [C: 032] Tampa decom - clean up 152.80.208.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/167868 (owner: 10Dzahn) [21:10:58] (03CR) 10Greg Grossmeier: "And for the record, this change broke updates to the Beta Cluster, causing other teams pain. Kaldari: Where can we put something that othe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168688 (https://bugzilla.wikimedia.org/72465) (owner: 10Kaldari) [21:13:59] deployers, is this harmless ? https://gerrit.wikimedia.org/r/#/c/167888/1/manifests/role/deployment.pp [21:14:04] RECOVERY - exim incoming message rate on iodine is OK: exim_messages_in OKAY: 3.0 [21:14:26] mutante: should be [21:19:54] ori: thanks [21:31:01] (03PS1) 10Dzahn: rancid - convert to module [puppet] - 10https://gerrit.wikimedia.org/r/168698 [21:32:15] (03PS1) 10Nemo bis: Allow all custom Meta-Wiki namespaces in Special:Book [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168699 (https://bugzilla.wikimedia.org/72493) [21:32:30] (03PS2) 10Dzahn: rancid - convert to module [puppet] - 10https://gerrit.wikimedia.org/r/168698 [21:33:54] PROBLEM - Certificate expiration on virt1000 is CRITICAL: SSL_CERT CRITICAL virt1000.wikimedia.org: certificate will expire on Jan 22 21:31:13 2015 GMT [21:37:34] surely that should be a warning? [21:37:42] (03PS1) 10John F. Lewis: Add $wmgAddWikiNotifyEmail for use by notifyNewProjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168701 [21:37:44] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/458/change/168698/html/netmon1001.wikimedia.org.html" [puppet] - 10https://gerrit.wikimedia.org/r/168698 (owner: 10Dzahn) [21:37:57] (03PS1) 10John F. Lewis: Remove hardcoding from notifyNewProjects [puppet] - 10https://gerrit.wikimedia.org/r/168702 [21:40:01] (03PS2) 10John F. Lewis: Remove hardcoding from notifyNewProjects [puppet] - 10https://gerrit.wikimedia.org/r/168702 (https://bugzilla.wikimedia.org/48786) [21:50:19] cmjohnson: hey! sorry to ping you while you are working. how are servers going? I'm excited to have them! [21:50:49] manybubbles...they're done and ready for install..i thought ottomata was going to do the installs...i can do them now if you like [21:51:19] manybubbles: and it's not even christmas! [21:51:39] cmjohnson: they need the right partitioning. I'm not sure how to do that. also, maybe we shouldn't hurry to add them on a friday afternoon :) [21:51:46] maybe I poke him onmonday [21:52:01] manybubbles you want the same partitioning as the others [21:52:32] cmjohnson: yeah - with the mirror raid for os partition taking up some disk and the rest of striping for es [21:53:15] okay..then they're ready for install...there is an elasticsearch cfg for partman [21:54:23] <^d> Today's better than Arbor Day :D [21:54:41] don't go mess'n with Arbor Day... [21:55:37] ok. well. I'm not sure where to go from here. I'd bug ottomatta about it but he isn't about. So Monday probably [21:56:01] (03PS1) 10RobH: setting public ip for berkelium & curium servers [dns] - 10https://gerrit.wikimedia.org/r/168703 [21:57:14] manybubbles it's correct. Installing now...he +1 my cfg this morning [21:57:26] he went into management and cool [21:57:33] sorry - irc fail [21:57:38] cool. thanks for installing! [21:58:07] can you not get them into puppet though. I don't want to have them starting up elasticsearch at 6pm on friday afternoon. it'd probably be ok [21:58:31] but I don't want to get called at my kid's birthday party tomorrow morning because surprise! [21:58:45] <^d> Now I want cake. [21:58:48] i will do basic install and otto can do puppet Monday [21:58:53] <^d> I have leftover cheesecake! [21:59:17] (03CR) 10Cscott: [C: 031] "Looks fine. Schedule this to be SWATted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168699 (https://bugzilla.wikimedia.org/72493) (owner: 10Nemo bis) [21:59:53] (03PS1) 10RobH: setting server berkelium/curium install params [puppet] - 10https://gerrit.wikimedia.org/r/168705 [22:00:06] (03CR) 10Nemo bis: "If you want. Not that urgent, can wait for Reedy's deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168699 (https://bugzilla.wikimedia.org/72493) (owner: 10Nemo bis) [22:00:47] (03CR) 10RobH: [C: 032] setting public ip for berkelium & curium servers [dns] - 10https://gerrit.wikimedia.org/r/168703 (owner: 10RobH) [22:01:26] (03PS2) 10RobH: setting server berkelium/curium install params [puppet] - 10https://gerrit.wikimedia.org/r/168705 [22:02:15] (03CR) 10RobH: [C: 032] setting server berkelium/curium install params [puppet] - 10https://gerrit.wikimedia.org/r/168705 (owner: 10RobH) [22:05:34] but thanks! [22:30:06] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [22:38:09] (03PS1) 10Dzahn: contacts/outreach - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168713 [22:40:01] !log puppet disabled on uranium, do not enable [22:40:07] Logged the message, Master [22:40:39] PROBLEM - puppet last run on nickel is CRITICAL: CRITICAL: Puppet last ran 14449 seconds ago, expected 14400 [22:41:21] Reedy: mw1088: Avail 130G [22:49:36] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:04:09] PROBLEM - Host berkelium is DOWN: CRITICAL - Plugin timed out after 15 seconds [23:04:48] ^^ that's me, just rebooting after dist-upgrade [23:05:28] RECOVERY - Host berkelium is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [23:07:52] (03PS1) 10Dzahn: management/ipmi - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168719 [23:10:14] (03PS3) 10Dzahn: rancid - move to module [puppet] - 10https://gerrit.wikimedia.org/r/168698 [23:12:39] PROBLEM - Varnishkafka log producer on amssq42 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [23:18:47] (03PS1) 10Dzahn: fundraising logging - move out of misc [puppet] - 10https://gerrit.wikimedia.org/r/168723 [23:30:30] (03PS1) 10Calak: Create "Abuse filter editor" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168725 (https://bugzilla.wikimedia.org/72502) [23:35:08] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet last ran 29026 seconds ago, expected 28800 [23:40:09] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet last ran 29326 seconds ago, expected 28800 [23:45:09] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet last ran 29625 seconds ago, expected 28800 [23:47:29] (03PS1) 10Dzahn: Tampa cleanup - remove pmtpa monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/168726 [23:50:20] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet last ran 29926 seconds ago, expected 28800 [23:50:56] (03CR) 10Dzahn: [C: 032] Tampa cleanup - remove pmtpa monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/168726 (owner: 10Dzahn) [23:55:10] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet last ran 30225 seconds ago, expected 28800 [23:56:40] (03PS1) 10Dzahn: kafka - remove/replace pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/168727 [23:57:41] (03PS2) 10Dzahn: kafka - remove/replace pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/168727