[00:00:05] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151216T0000). Please do the needful. [00:00:05] csteipp legoktm yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:20] o/ [00:00:21] o/ [00:03:44] Sorry for being late [00:04:19] lol @ csteipp's patch [00:04:20] csteipp: Ahm, what does 'passwordCannotBePopular' => INT_MAX do? [00:04:36] RoanKattouw: Checks all passwords in the CDB [00:04:57] CDB? [00:05:00] We grab the rank of the password, and compare it to the policy. So INT_MAX is bigger than any index into the cdb. [00:05:17] OK [00:05:27] So there's not something trying to iterate with INT_MAX as an upper bound [00:05:41] No :) [00:05:49] (03PS3) 10Catrope: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258387 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:05:57] (03CR) 10Catrope: [C: 032] Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258387 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:06:10] because opening the cdb on every request to count the number of entries would just be lolworthy [00:06:23] (03CR) 10Catrope: [C: 032] Fix merging of $wgExtractsRemoveClasses post-extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259425 (https://phabricator.wikimedia.org/T121592) (owner: 10Legoktm) [00:07:01] (03Merged) 10jenkins-bot: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258387 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:07:23] (03Merged) 10jenkins-bot: Fix merging of $wgExtractsRemoveClasses post-extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259425 (https://phabricator.wikimedia.org/T121592) (owner: 10Legoktm) [00:12:00] (03PS1) 10Yuvipanda: tools: No more puppet client classs [puppet] - 10https://gerrit.wikimedia.org/r/259429 [00:12:05] bd808: ^ [00:12:15] (03PS2) 10Yuvipanda: tools: No more puppet client classs [puppet] - 10https://gerrit.wikimedia.org/r/259429 [00:12:25] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: No more puppet client classs [puppet] - 10https://gerrit.wikimedia.org/r/259429 (owner: 10Yuvipanda) [00:12:40] ShiveringPanda: bam! I was just going to look for that [00:12:52] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Password policy for staff, take 3; fix $wgExtractsRemoveClasses (duration: 00m 31s) [00:12:57] bd808: yeah, when I removed it the es patch wasn't merged and when I merged it I didn't look for it [00:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:05] bd808: try logging in now? [00:13:18] * RoanKattouw curses the office wifi [00:13:20] yurik: You around for your SWAT patch? [00:13:23] legoktm, csteipp: Yours are done ---^^ [00:13:25] RoanKattouw, yep [00:13:30] RoanKattouw: thanks, warnings have stopped [00:14:00] Thanks! [00:14:21] finally got that working right... [00:14:22] (03PS2) 10Ori.livneh: $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 [00:14:35] (03PS3) 10Ori.livneh: $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 [00:15:18] (03CR) 10jenkins-bot: [V: 04-1] $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 (owner: 10Ori.livneh) [00:21:19] 6operations, 10Wikimedia-Mailing-lists: Create a new mailing list: elf@lists.wikimedia.org - https://phabricator.wikimedia.org/T120523#1883463 (10Dzahn) [00:23:38] !log catrope@tin Synchronized php-1.27.0-wmf.9/extensions/Graph/: SWAT: update Vega (duration: 00m 30s) [00:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:50] 6operations, 10Wikimedia-Mailing-lists: Create a new mailing list: elf@lists.wikimedia.org - https://phabricator.wikimedia.org/T120523#1883464 (10Dzahn) added operations - until recently these were handled by volunteer(s) but for now ops is needed (unfortunately) because only ops has the list creator access no... [00:26:30] yurik: Yours is done ---^^ [00:26:41] RoanKattouw, already tested, works, thx ) [00:28:50] RoanKattouw, could you do one more small config change? [00:28:59] Sure [00:29:47] RoanKattouw, basically i want to switch wmgGraphEnableGZip to true [00:29:57] it seems to be working in labs [00:30:51] i will submit a patch in a sec [00:30:53] What does that do again? [00:30:59] Changes the way things are stored, right? [00:31:15] RoanKattouw, it stores graphs as gzip in pageprops [00:31:18] instead of clear text [00:31:32] this fixes a bug with many graphs not fitting into 64k [00:32:47] (03PS1) 10Yurik: Enable wmgGraphEnableGZip for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259433 [00:32:52] RoanKattouw, ^ [00:33:07] lets do it two step - first mediawiki, quick test, and than all [00:33:15] just to be on the safer side [00:33:25] should be done within 15 min [00:34:06] And this is backwards compatible with things that are already stored in the clear earlier? [00:34:12] RoanKattouw, yep [00:34:15] Cool [00:34:20] (03CR) 10Catrope: [C: 032] Enable wmgGraphEnableGZip for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259433 (owner: 10Yurik) [00:34:34] RoanKattouw, its even forward compatible - if we disable it ) [00:34:42] (03Merged) 10jenkins-bot: Enable wmgGraphEnableGZip for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259433 (owner: 10Yurik) [00:34:47] Oh nice [00:36:06] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable wmgGraphEnableGzip on mediawikiwiki (duration: 00m 30s) [00:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:22] (03PS1) 10Yurik: Enable wmgGraphEnableGZip for all graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259434 [00:36:37] RoanKattouw, this ^ is when we know the first one works [00:36:58] yurik: OK, the first one is up [00:38:12] RoanKattouw, all good [00:38:40] (03CR) 10Catrope: [C: 032] Enable wmgGraphEnableGZip for all graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259434 (owner: 10Yurik) [00:39:32] (03Merged) 10jenkins-bot: Enable wmgGraphEnableGZip for all graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259434 (owner: 10Yurik) [00:39:55] bblack: https://www.mediawiki.org/wiki/Manual:$wgForeignFileRepos#Details this warning is not quite correct anymore, right? [00:40:31] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable wmgGraphEnableGzip on all wikis (duration: 00m 30s) [00:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:41:27] 6operations, 10Wikimedia-Mailing-lists: Create a new mailing list: elf@lists.wikimedia.org - https://phabricator.wikimedia.org/T120523#1883493 (10Dzahn) I have created the list. I set the email addresses listed above as initial admins, set the list description to "Education and Legislative Forum", and then re... [00:43:09] MatmaRex: If people are manually configuring they might want to still know that https is better than http. We never improved the underlying redirect behavior in MW there. [00:43:54] ostriches: yeah, but the part about "it is still planned to switch to HTTPS in the future" looks pretty outdated to me. ;) [00:44:03] (03PS1) 10CSteipp: Set password policy for global sysadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259436 (https://phabricator.wikimedia.org/T104370) [00:44:34] MatmaRex: Rewording is needed then :) [00:45:41] * MatmaRex has already written six kilobytes of documentation today [00:45:41] RoanKattouw, thanks for your help! [00:47:45] 6operations, 10Wikimedia-Mailing-lists: Create a new mailing list: elf@lists.wikimedia.org - https://phabricator.wikimedia.org/T120523#1883507 (10Dzahn) I saw you said "private" so i set the archiving option to "private" and "subscribe_policy" to "require approval" assuming that's what you want. Please see ht... [00:48:04] 6operations, 10Wikimedia-Mailing-lists: Create a new mailing list: elf@lists.wikimedia.org - https://phabricator.wikimedia.org/T120523#1883508 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:49:27] (03PS1) 10CSteipp: Set password policy for global steward group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259439 (https://phabricator.wikimedia.org/T104371) [00:51:12] 6operations, 10Wikimedia-Mailing-lists: Add @dpatrick to ops mailing list - https://phabricator.wikimedia.org/T121441#1879011 (10Dzahn) [00:52:20] 6operations, 10Wikimedia-Mailing-lists: Add @dpatrick to ops mailing list - https://phabricator.wikimedia.org/T121441#1883520 (10Dzahn) 5Open>3Resolved Successfully subscribed: dpatrick@wikimedia.org [00:56:06] 6operations, 5Patch-For-Review: Add openldap/labs servers to backup - https://phabricator.wikimedia.org/T120919#1883526 (10Dzahn) a:5Dzahn>3None [00:56:44] 6operations, 5Patch-For-Review: Add openldap/labs servers to backup - https://phabricator.wikimedia.org/T120919#1864777 (10Dzahn) per the comments on that gerrit link above, Alex said " I 'll try and create a bpipe approach in labs as a first step." [00:57:51] (03CR) 10Dzahn: "@Alex thank you, i added a comment on thet ticket and gave it back to pool" [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) (owner: 10Dzahn) [00:59:06] 6operations: Add openldap/labs servers to backup - https://phabricator.wikimedia.org/T120919#1864777 (10Dzahn) [01:01:07] RoanKattouw, are you sure it synced? [01:01:29] Yup [01:02:04] RoanKattouw, any easy way to check if for example ruwiki has that value? [01:02:23] i think we had some interactive shell script somewhere [01:02:52] 6operations, 10Deployment-Systems: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1883535 (10Dzahn) after taking a look at tin and mira (and finding more inconsistencies and "system" users with UIDs over 10000) i suggest we do UID 120 for l10nupdate [01:03:10] Yup I'll check [01:03:47] catrope@tin:/srv/mediawiki-staging$ mwscript eval.php ruwiki [01:03:49] > var_dump($wgGraphEnableGZip); [01:03:50] hmm, maybe that var is only used in v9, that's why its not showing up in v8.. in which case i will simply wait for the train to finish [01:03:50] bool(true) [01:04:00] Oh, hah, right [01:04:02] yeah [01:04:12] all's good ,thanks! [01:04:13] )) [01:04:27] oh yeah, the eval.php, thanks for reminding [01:05:31] (03PS1) 10Dzahn: scap: change l10nupdate UID from 10002 to 120 [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) [01:06:46] (03PS2) 10Dzahn: scap: change l10nupdate UID from 10002 to 120 [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) [01:06:58] (03PS3) 10Dzahn: scap: change l10nupdate UID from 10002 to 120 [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) [01:09:23] (03PS4) 10Dzahn: scap: change l10nupdate UID from 10002 to 120 [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) [01:11:17] is there any kind of history in grafana? I edited a graph and then saved it and...the new graph completed replaced the rest of the dashboard somehow [01:11:44] oh nm, i'm just on the wrong query string... [01:11:52] (03PS5) 10Dzahn: scap: change l10nupdate UID from 10002 to 120 [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) [01:13:04] 6operations, 10Wikimedia-Mailing-lists: Create a new mailing list: elf@lists.wikimedia.org - https://phabricator.wikimedia.org/T120523#1883547 (10SlimVirgin) Dzahn, thank you for doing this. [01:16:12] (03CR) 10Yuvipanda: [C: 031] clean-pam-config: move backupfiles to a different dir [puppet] - 10https://gerrit.wikimedia.org/r/259296 (https://phabricator.wikimedia.org/T121533) (owner: 10Andrew Bogott) [01:17:14] 6operations: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610#1883565 (10Dzahn) 3NEW [01:17:48] 6operations, 10Deployment-Systems, 5Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1856865 (10Dzahn) [01:17:50] 6operations: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610#1883574 (10Dzahn) [01:18:47] (03PS3) 10Andrew Bogott: clean-pam-config: move backupfiles to a different dir [puppet] - 10https://gerrit.wikimedia.org/r/259296 (https://phabricator.wikimedia.org/T121533) [01:18:49] (03PS2) 10Mattflaschen: Add computed dblist for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 [01:20:11] 6operations: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610#1883576 (10Dzahn) https://docs.puppetlabs.com/references/latest/type.html#user-attribute-system system Whether the user is a system user, according to the OS’s criteria; on most platforms, a UID less than or equal to 500 ind... [01:20:32] (03CR) 10Krinkle: [C: 031] $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 (owner: 10Ori.livneh) [01:21:10] 6operations: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610#1883577 (10Dzahn) p:5Triage>3Normal [01:23:55] (03CR) 10Andrew Bogott: [C: 032] clean-pam-config: move backupfiles to a different dir [puppet] - 10https://gerrit.wikimedia.org/r/259296 (https://phabricator.wikimedia.org/T121533) (owner: 10Andrew Bogott) [01:25:10] 6operations, 10Wikimedia-Mailing-lists: Add @dpatrick to ops mailing list - https://phabricator.wikimedia.org/T121441#1883584 (10dpatrick) Thanks. I'm on. [01:32:29] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1883596 (10Dzahn) @Lydia_Pintscher hmm, any similarity between the ones that don't work vs. the ones that work? Like for example these are all in the Wikidata: namespace but maybe others are not? D... [01:34:07] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1883597 (10Papaul) @Robh I am getting this error message during install. {F3107778} [01:50:52] (03PS4) 10EBernhardson: Cron job to rebuild completion indices [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T112028) [01:52:16] (03PS5) 10EBernhardson: Cron job to rebuild completion indices [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T112028) [01:52:39] (03PS6) 10EBernhardson: Cron job to rebuild completion indices [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T112028) [01:55:15] (03PS1) 10EBernhardson: [elasticsearch] Collect cluster health stats about shard movement [puppet] - 10https://gerrit.wikimedia.org/r/259443 (https://phabricator.wikimedia.org/T117284) [02:05:25] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures [02:13:37] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1883663 (10RobH) Just continue past it, we don't need swap space on it. Thanks for checking! [02:18:11] !log installing OS on kafka200[1-2] [02:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:16] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:25] PROBLEM - Apache HTTP on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [02:30:08] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.8) (duration: 13m 37s) [02:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:37] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [02:31:25] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 65207 bytes in 7.764 second response time [02:31:26] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.306 second response time [02:42:51] !log installation complete on kafka200[1-2] [02:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:02] !log kafka200[1-2] signing puppet certs, salt key initial run [02:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:46] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 16m 00s) [03:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:10:45] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [03:13:14] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1883705 (10Papaul) [03:14:32] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1881702 (10Papaul) signing puppet certs, salt key complete [03:27:28] Anyone know what this means: Dec 13 11:10:57 elastic1012 kernel: [35310150.410504] TCP: TCP: Possible SYN flooding on port 9300. Sending cookies. Check SNMP counters. [03:27:44] i see it exactly once in syslog on elastic1012, and it came in just a few seconds before the node dropped from the cluster [03:29:39] actually, that came in 17 seconds after the server dropped from the cluster. That port is the inter-node transport. [03:34:16] would a reasonable guess be the server was still holding onto the port, but not reading from it, and the queue just maxed out? [03:35:30] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [03:51:47] (03PS1) 10Andrew Bogott: Revert "Revert "Reorder modules in common-account"" [puppet] - 10https://gerrit.wikimedia.org/r/259445 [03:52:02] mutante: ^ [04:00:26] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1883739 (10Papaul) a:5Papaul>3aaron Hey Aaron the installation is complete on kafka200[1-2] I f you have any questions please let me know. Thanks [04:00:55] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1883741 (10Ottomata) CooOoL ok thanks! [04:01:19] (03CR) 10Dzahn: [C: 031] Revert "Revert "Reorder modules in common-account"" [puppet] - 10https://gerrit.wikimedia.org/r/259445 (owner: 10Andrew Bogott) [04:03:00] (03PS2) 10Andrew Bogott: Revert "Revert "Reorder modules in common-account"" [puppet] - 10https://gerrit.wikimedia.org/r/259445 [04:04:43] (03CR) 10Andrew Bogott: [C: 032] Revert "Revert "Reorder modules in common-account"" [puppet] - 10https://gerrit.wikimedia.org/r/259445 (owner: 10Andrew Bogott) [04:42:51] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures [04:44:12] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [100000000.0] [04:48:12] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [08:28:45] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:45] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:27] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [08:33:39] (03PS1) 10Yurik: Disable 'Graph:' ns for meta and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259448 [08:33:39] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:33:40] !log Rebooting labstore1001 [08:33:41] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:52] !log Rebooting labstore1001 [08:33:53] PROBLEM - puppet last run on mw2037 is CRITICAL: CRITICAL: puppet fail [08:33:54] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 307 [08:33:54] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 328 [08:33:54] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 348 [08:33:54] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 376 [08:33:56] Looks like morebots went missing. [08:33:56] Leah: yeah, labs NFS is dead and with it all of tools [08:33:57] :-( [08:33:57] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Seconds_Behind_Master: 46 [08:33:59] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [08:34:00] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 205, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [08:34:07] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 321 [08:34:08] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [08:34:08] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [08:34:09] Wikimedia Platform operations, serious stuff | Status: Tools outage | Log: http://bit.ly/wikisal | Channel logs: http://ur1.ca/edq22 | Ops Clinic Duty: ottomata [08:34:09] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [08:34:09] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [08:34:09] RECOVERY - puppet last run on mw2037 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:34:16] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [08:34:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 207, down: 0, dormant: 0, excluded: 0, unused: 0 [08:34:18] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:34:20] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [08:34:20] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [08:34:20] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [08:34:23] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [08:34:24] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is inactive [08:34:24] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:34:24] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:34:24] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:34:24] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:34:25] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:34:25] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:34:25] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:34:25] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:34:25] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:34:25] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:34:25] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:34:25] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:34:26] (03PS1) 10Yuvipanda: labs: Limit who can login via the ssh key lookup tool too [puppet] - 10https://gerrit.wikimedia.org/r/259455 [08:34:26] (03PS2) 10Yuvipanda: labs: Limit who can login via the ssh key lookup tool too [puppet] - 10https://gerrit.wikimedia.org/r/259455 [08:34:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:34:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:34:29] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:34:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:34:30] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:34:30] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:34:30] (03PS1) 10Cmjohnson: changing kafka1001 and 1002 to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/259457 [08:34:30] (03CR) 10Cmjohnson: [C: 032] changing kafka1001 and 1002 to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/259457 (owner: 10Cmjohnson) [08:34:33] PROBLEM - Disk space on kafka1001 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - configured eth on kafka1001 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - RAID on kafka1001 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - dhclient process on kafka1001 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - dhclient process on kafka1002 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - puppet last run on kafka1001 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - salt-minion processes on kafka1001 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - puppet last run on kafka1002 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - DPKG on kafka1002 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - DPKG on kafka1001 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - salt-minion processes on kafka1002 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - Disk space on kafka1002 is CRITICAL: Connection refused by host [08:34:33] PROBLEM - RAID on kafka1002 is CRITICAL: Connection refused by host [08:34:33] <_joe_> cmjohnson1: I guess that's you reimaging, right? [08:34:33] PROBLEM - configured eth on kafka1002 is CRITICAL: Connection refused by host [08:34:33] _joe_ yes I am [08:34:33] _joe_: btw, the kubernetes tools (the ones that haven't opted into using NFS) are fine [08:34:33] grrrit-wm: is up, tools.wmflabs.org/nagf is up [08:34:33] <_joe_> YuviPanda: as I told you, you should market that ;) [08:34:33] yeah [08:34:34] needs more work tho. we only tracked down and fixed an ip-masq issue yesterday [08:34:39] (03PS3) 10Yuvipanda: labs: Limit who can login via the ssh key lookup tool too [puppet] - 10https://gerrit.wikimedia.org/r/259455 [08:34:39] (03PS1) 10Yuvipanda: [WIP] labstore: Skip activating snapshots by default [puppet] - 10https://gerrit.wikimedia.org/r/259458 [08:34:40] RECOVERY - salt-minion processes on kafka1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:34:40] RECOVERY - DPKG on kafka1002 is OK: All packages OK [08:34:40] RECOVERY - salt-minion processes on kafka1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:34:40] RECOVERY - DPKG on kafka1001 is OK: All packages OK [08:34:40] RECOVERY - Disk space on kafka1002 is OK: DISK OK [08:34:40] RECOVERY - Disk space on kafka1001 is OK: DISK OK [08:34:40] RECOVERY - configured eth on kafka1001 is OK: OK - interfaces up [08:34:41] RECOVERY - RAID on kafka1002 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [08:34:41] RECOVERY - RAID on kafka1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [08:34:41] RECOVERY - configured eth on kafka1002 is OK: OK - interfaces up [08:34:41] RECOVERY - dhclient process on kafka1001 is OK: PROCS OK: 0 processes with command name dhclient [08:34:41] RECOVERY - dhclient process on kafka1002 is OK: PROCS OK: 0 processes with command name dhclient [08:34:41] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:34:41] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:34:41] PROBLEM - NTP on kafka1001 is CRITICAL: NTP CRITICAL: Offset unknown [08:34:42] PROBLEM - Last backup of the others filesystem on labstore1001 is CRITICAL: Timeout while attempting connection [08:34:42] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:43] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [08:34:45] RECOVERY - NTP on kafka1001 is OK: NTP OK: Offset 0.01583135128 secs [08:34:45] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [08:34:48] (03PS1) 10Yuvipanda: labstore: Better error-checking(?) for start-nfs [puppet] - 10https://gerrit.wikimedia.org/r/259459 [08:34:48] (03PS7) 10Mobrovac: RESTBase: Switch to service::node [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) [08:34:49] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is inactive [08:34:50] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [08:38:28] (03PS1) 10Mobrovac: RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) [08:40:50] (03CR) 10Mobrovac: RESTBase: Switch to service::node (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [08:41:05] <_joe_> mobrovac: need me to take a look? [08:49:21] (03CR) 10Muehlenhoff: [C: 031] RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [08:50:34] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://tools-checker.wmflabs.org:80/nfs/home - 306 bytes in 0.015 second response time [08:53:58] !log stopped nfs-kernel-server on labstore1001 [08:53:58] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [08:53:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 205, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [08:53:58] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:59] !log start nfs-kernel-server on labstore1001 [08:54:42] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.032 second response time [08:54:51] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 959958 bytes in 4.282 second response time [08:54:53] (03CR) 10Mobrovac: [C: 04-1] "The compiler is happy about this change - https://puppet-compiler.wmflabs.org/1492/ , but the dependent firejail change somehow fails to c" [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [08:55:40] hello tech-ops team! My name is Luca and I'll be part of the Analytics team in January.. I have been lurking in your irc chat for a while to get an idea of what you guys do :) [08:55:48] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [08:55:57] (03CR) 10Mobrovac: [C: 04-1] "For some reason the puppet compiler fails to compile this change - https://puppet-compiler.wmflabs.org/1491/ :( Thoughts?" [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [08:56:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 207, down: 0, dormant: 0, excluded: 0, unused: 0 [08:58:34] hi elukey ! [08:58:42] o/ [08:59:59] elukey: hi Luca! [09:00:33] elukey: you'll see there's quite the bot activity here, depending on your irc client you might want to color/hilight the bots differently [09:02:52] godog: you're completely right, I am studying a configuration for irssi that avoids brain failure in the mornings :) [09:52:17] (03CR) 10Alexandros Kosiaris: "Yes, failed merge for some reason" [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [09:52:33] (03CR) 10Tulsi Bhagat: [C: 031] Allow sysops to add and remove accounts from bot group on mai.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239854 (https://phabricator.wikimedia.org/T111898) (owner: 10Mdann52) [10:01:52] (03CR) 10Tulsi Bhagat: [C: 031] Template editor group on hi.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258444 (https://phabricator.wikimedia.org/T120342) (owner: 10Dereckson) [10:09:54] (03CR) 10Merlijn van Deen: [C: 04-1] "otherwise lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259455 (owner: 10Yuvipanda) [10:10:10] (03PS2) 10Giuseppe Lavagetto: mediawiki: add conftool-specifc credentials and scripts [puppet] - 10https://gerrit.wikimedia.org/r/258979 [10:10:12] (03PS20) 10Giuseppe Lavagetto: etcd: auth puppetization [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [10:10:14] (03PS3) 10Giuseppe Lavagetto: conftool: add support for ACLs, helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/258975 [10:13:26] (03PS2) 10Yuvipanda: labstore: Better error-checking(?) for start-nfs [puppet] - 10https://gerrit.wikimedia.org/r/259459 [10:13:28] (03PS2) 10Yuvipanda: labstore: Skip activating snapshots by default [puppet] - 10https://gerrit.wikimedia.org/r/259458 [10:13:30] (03PS4) 10Yuvipanda: labs: Limit who can login via the ssh key lookup tool too [puppet] - 10https://gerrit.wikimedia.org/r/259455 [10:14:02] (03CR) 10Mark Bergsma: [C: 031] labstore: Better error-checking(?) for start-nfs [puppet] - 10https://gerrit.wikimedia.org/r/259459 (owner: 10Yuvipanda) [10:14:45] (03CR) 10Mark Bergsma: [C: 031] labstore: Skip activating snapshots by default [puppet] - 10https://gerrit.wikimedia.org/r/259458 (owner: 10Yuvipanda) [10:17:46] (03PS3) 10Yuvipanda: labstore: Better error-checking(?) for start-nfs [puppet] - 10https://gerrit.wikimedia.org/r/259459 [10:17:48] (03PS3) 10Yuvipanda: labstore: Skip activating snapshots by default [puppet] - 10https://gerrit.wikimedia.org/r/259458 [10:17:50] (03PS5) 10Yuvipanda: labs: Limit who can login via the ssh key lookup tool too [puppet] - 10https://gerrit.wikimedia.org/r/259455 [10:20:02] (03PS6) 10Yuvipanda: labs: Limit who can login via the ssh key lookup tool too [puppet] - 10https://gerrit.wikimedia.org/r/259455 [10:20:36] (03CR) 10Yuvipanda: [C: 032] labstore: Skip activating snapshots by default [puppet] - 10https://gerrit.wikimedia.org/r/259458 (owner: 10Yuvipanda) [10:20:53] (03CR) 10Yuvipanda: [C: 032] labstore: Better error-checking(?) for start-nfs [puppet] - 10https://gerrit.wikimedia.org/r/259459 (owner: 10Yuvipanda) [10:22:37] (03CR) 10Merlijn van Deen: [C: 031] "tested with a few examples, seems to work!" [puppet] - 10https://gerrit.wikimedia.org/r/259455 (owner: 10Yuvipanda) [10:25:03] valhallasw`cloud: <3 thanks [10:25:11] valhallasw`cloud: not gonna merge now tho :D [10:25:35] 6operations, 10DBA, 7Wikimedia-log-errors: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035 - https://phabricator.wikimedia.org/T107072#1884071 (10jcrespo) 5Open>3Resolved This issue is solved, the... [10:27:39] PROBLEM - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code [10:28:35] yes yes that's fine [10:28:46] also nice error message, icinga-wm [10:35:47] (03CR) 10Alexandros Kosiaris: [C: 031] [elasticsearch] Collect cluster health stats about shard movement [puppet] - 10https://gerrit.wikimedia.org/r/259443 (https://phabricator.wikimedia.org/T117284) (owner: 10EBernhardson) [10:36:16] (03PS1) 10Faidon Liambotis: labs: add an access.conf stanza to always allow cron [puppet] - 10https://gerrit.wikimedia.org/r/259471 [10:37:16] (03CR) 10Faidon Liambotis: [C: 032] labs: add an access.conf stanza to always allow cron [puppet] - 10https://gerrit.wikimedia.org/r/259471 (owner: 10Faidon Liambotis) [10:38:37] (03PS1) 10Yuvipanda: labstore: Enable snapshots immediately after creation [puppet] - 10https://gerrit.wikimedia.org/r/259472 [10:38:43] mark: ^ [10:39:38] (03CR) 10Mobrovac: "Duh, thnx @akosiaris. Will rebase both patches." [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [10:39:46] 6operations, 10DBA, 7Wikimedia-log-errors: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035 - https://phabricator.wikimedia.org/T107072#1884109 (10jcrespo) @Krinkle see T121623 [10:40:01] (03PS1) 10ArielGlenn: allow salt master to handle more than 1024 conns [puppet] - 10https://gerrit.wikimedia.org/r/259473 [10:41:45] (03CR) 10ArielGlenn: [C: 032] allow salt master to handle more than 1024 conns [puppet] - 10https://gerrit.wikimedia.org/r/259473 (owner: 10ArielGlenn) [10:43:39] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [10:44:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 205, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [10:45:28] (03PS1) 10ArielGlenn: remove ariel's non-yubi ssh key [puppet] - 10https://gerrit.wikimedia.org/r/259474 [10:45:55] (03PS2) 10Mobrovac: RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) [10:45:57] (03PS8) 10Mobrovac: RESTBase: Switch to service::node [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) [10:46:57] mark: ok, I'm going to merge and see what happens :) [10:47:09] (03PS2) 10Yuvipanda: labstore: Enable snapshots immediately after creation [puppet] - 10https://gerrit.wikimedia.org/r/259472 [10:47:11] (03CR) 10ArielGlenn: [C: 032] remove ariel's non-yubi ssh key [puppet] - 10https://gerrit.wikimedia.org/r/259474 (owner: 10ArielGlenn) [10:47:27] (03PS3) 10Yuvipanda: labstore: Enable snapshots immediately after creation [puppet] - 10https://gerrit.wikimedia.org/r/259472 [10:47:35] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Enable snapshots immediately after creation [puppet] - 10https://gerrit.wikimedia.org/r/259472 (owner: 10Yuvipanda) [10:49:31] apergos: I merged yours too [10:49:50] RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful [10:49:53] YuviPanda: I saw, when trying to merge, it offered me yours but when I hit return it said there was nothing to do [10:49:57] thanks [10:52:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 207, down: 0, dormant: 0, excluded: 0, unused: 0 [10:53:39] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [10:59:18] (03PS1) 10Yuvipanda: labstore: Activate volumes before mounting them [puppet] - 10https://gerrit.wikimedia.org/r/259475 [10:59:33] mark: ^ is the start-nfs change [10:59:43] actually I should probably put a sync-exports call there too [11:00:01] and perhaps a FIXME to look at alternative ways of doing that [11:00:05] possibly involving lvm.conf [11:00:12] or explicitly listing snapshots [11:01:21] mark: ok [11:02:48] (03PS2) 10Yuvipanda: labstore: Activate volumes before mounting them [puppet] - 10https://gerrit.wikimedia.org/r/259475 [11:03:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A couple of points, the premise looks good and solves an actual problem we got right now." [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [11:03:41] (03PS1) 10Yuvipanda: labstore: Run sync-exports in start-nfs too [puppet] - 10https://gerrit.wikimedia.org/r/259476 [11:04:06] mark: ^ two patches now (one for activatioin, one for sync-exports which bit us) [11:05:12] (03CR) 10Mark Bergsma: [C: 031] labstore: Run sync-exports in start-nfs too [puppet] - 10https://gerrit.wikimedia.org/r/259476 (owner: 10Yuvipanda) [11:06:11] (03CR) 10Mark Bergsma: [C: 04-1] labstore: Activate volumes before mounting them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259475 (owner: 10Yuvipanda) [11:09:54] (03PS2) 10Yuvipanda: labstore: Run sync-exports in start-nfs too [puppet] - 10https://gerrit.wikimedia.org/r/259476 [11:09:56] (03PS3) 10Yuvipanda: labstore: Activate volumes before mounting them [puppet] - 10https://gerrit.wikimedia.org/r/259475 [11:10:08] mark: updated [11:11:14] (03CR) 10Mark Bergsma: [C: 031] labstore: Activate volumes before mounting them [puppet] - 10https://gerrit.wikimedia.org/r/259475 (owner: 10Yuvipanda) [11:12:10] (03CR) 10Yuvipanda: [C: 032] labstore: Activate volumes before mounting them [puppet] - 10https://gerrit.wikimedia.org/r/259475 (owner: 10Yuvipanda) [11:12:24] (03CR) 10Yuvipanda: [C: 032] labstore: Run sync-exports in start-nfs too [puppet] - 10https://gerrit.wikimedia.org/r/259476 (owner: 10Yuvipanda) [11:15:06] 6operations, 6Labs: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#1884200 (10yuvipanda) [11:26:48] (03PS9) 10Mobrovac: RESTBase: Switch to service::node [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) [11:29:39] (03PS3) 10Mobrovac: RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) [11:30:59] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [11:31:19] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [11:34:37] akosiaris: _joe_: i really don't understand what's going on for https://puppet-compiler.wmflabs.org/1496/ [11:34:37] !log merged YuviPanda: labstore: Run sync-exports in start-nfs too (d5273d9) and YuviPanda: labstore: Activate volumes before mounting them (dff32de) on palladium [11:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:00] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [11:35:18] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:40:01] (03CR) 10Mobrovac: "This is still failing - https://puppet-compiler.wmflabs.org/1496/ while the other patch seems to compile just fine - https://puppet-compil" [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [11:43:25] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1884252 (10akosiaris) >>! In T120281#1880860, @EBernhardson wrote: > Data would move in both directions. > > The two l... [11:49:20] PROBLEM - Disk space on restbase1004 is CRITICAL: DISK CRITICAL - free space: /var 105602 MB (3% inode=99%) [11:49:22] !log restarting and reconfiguring mysql on db2023 [11:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:59:08] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:03:31] (03PS1) 10Jcrespo: Reconfiguring mysqls db1022 and s6 codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/259480 [12:03:53] (03CR) 10Muehlenhoff: "Better add your key in parallel for the testing period, see e.g. 8ad4e24e6a91af737054a70beaaf84864c967199 for the format in the YAML file." [puppet] - 10https://gerrit.wikimedia.org/r/253905 (owner: 10Jcrespo) [12:04:39] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:12:30] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:13:10] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:15:32] (03CR) 10Jcrespo: "True." [puppet] - 10https://gerrit.wikimedia.org/r/253905 (owner: 10Jcrespo) [12:18:01] hmm. what exactly does the nightly 'sync-l10n' script do? [12:19:28] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1884293 (10mark) a:5RobH>3None Alright, let's get a quote for 3 new boxes then. [12:23:53] (03PS1) 10Jcrespo: Depool db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259482 [12:24:29] (03CR) 10Jcrespo: [C: 032] Depool db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259482 (owner: 10Jcrespo) [12:26:00] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1022 for maintenance (duration: 00m 37s) [12:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:51] !log restarting and reconfiguring mysql at db1022 [12:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:31:44] (03CR) 10Jcrespo: [C: 032] Reconfiguring mysqls db1022 and s6 codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/259480 (owner: 10Jcrespo) [12:43:54] Geez. Where did Phab go? 503. [12:44:03] Ah. Intermittent. [12:47:54] jynus, hi! do you have 3 mins for a question? [12:48:23] mforns, wait 1 sec [12:48:28] aha [12:50:41] https://phabricator.wikimedia.org/P2427 [12:51:06] tell me, mforns [12:51:36] hey jynus, in analytics we are discussing large tables in m4-master being too larg [12:51:39] *large [12:51:47] yes [12:51:56] I mentioned partitioning [12:52:09] that would help with maintanance problems [12:52:15] and I remebered, but was not sure, that when implementing auto-purging... [12:52:33] I thought: 1) we would implement auto-purging in the replicas [12:52:49] 2) all tables in m4-master would be fully purged after 90 days [12:52:55] andre__, https://phabricator.wikimedia.org/T112776#1884320 [12:53:06] so historical data would be kept in the replicas only? [12:53:22] jynus, I know. :) [12:53:27] data on the master is now deleted after 45 days [12:53:27] thanks [12:53:28] and I wanted to know if that is actually so, or if I was wrong [12:53:42] jynus, all tables? [12:53:49] on the master [12:53:52] aha [12:54:07] it was that, or running out of space [12:54:17] so the problems with large tables is not in the master? [12:54:29] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [12:54:31] the problem with large tables is everywhere [12:54:35] hehe [12:55:11] even with 45 days worth of data, m4-master tables are too large? [12:55:12] purging and partitioning are different dicussions, but both are needed [12:55:25] purging is required to save space [12:55:31] ok [12:55:59] but saving space requires defragementing or converting to TokuDB, and those require having diferent tables [12:56:11] I see [12:56:12] also, large tables create other problems [12:56:32] if you only do one of those, you do not save all problems [12:56:56] the slaves currently do not have a lack of space problem (for now) [12:57:08] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [12:57:12] but they would benefit from smaller tables in any case [12:58:18] jynus, the thing is: analytics is considering deleting the Echo table to avoid problems, and I was wondering if this will help, or it does not matter [12:58:34] deleting tables always helps [12:58:34] there are people interested in Echo table contents [12:58:37] aha [12:58:47] we could keep them on the slaves only [12:59:03] (but they could not be updated there) [12:59:10] but you said all tables in m4-master only hold the las 45 days of data is that right? [12:59:37] yes, by that time, they are considered alrady copied to the slaves [13:00:14] jynus, ok cool, thanks for the explanations! [13:00:17] :] [13:00:28] the purge does not affect the slaves, yet [13:00:43] aha, I imagined that, thx [13:00:55] the main help would be converting all to tokudb, as said on the ticket [13:01:13] that requires either partitioning OR stoping writes [13:01:24] that is why I prefer paritioning [13:03:01] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:03:43] I think everthing is good now, gone to lunch [13:04:20] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:07:09] jynus, how long would writes need to be stopped to convert tables to tokudb? [13:21:54] mobrovac: ping me when around re: restbase and service::node [13:23:02] godog: i'm here, but my vps seems to have conn problems :( [13:24:04] godog: we can start whenever you're ready [13:26:15] <_joe_> I'm at lunch, will be back later [13:28:51] mobrovac: ok! so disabling puppet first on restbase/aqs/sca/scb and merge https://gerrit.wikimedia.org/r/#/c/257898/ plus https://gerrit.wikimedia.org/r/#/c/259460 [13:29:55] godog: the path i'm thinking we should take is: (i) disable puppet in rb-staging aqs1* restbase[12]* sc[ab]* ; (b) merge https://gerrit.wikimedia.org/r/#/c/257898 ; (c) test https://gerrit.wikimedia.org/r/#/c/259460/ with pcc; (d) merge that; (e) test it in staging [13:30:50] (f) enable on sc[ab]* ; (g) enable on rb1001 ... [13:31:24] mobrovac: LGTM [13:34:00] godog: moritzm: akosiaris: note re firejail and restbase, the firejail pkg is not installed on rb[12]*, so we shouldn't forget to install it before enabling it [13:34:05] (not today though) [13:34:42] (03PS2) 10Alexandros Kosiaris: diamond: Add openldap collector [puppet] - 10https://gerrit.wikimedia.org/r/258491 [13:35:07] mobrovac: service::node installs it, doesn't it? [13:35:17] * mobrovac checking [13:35:31] !log disable puppet on deployment_target:restbase/deploy and sca/scb [13:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:52] moritzm: indeed it does, sorry for the false alarm [13:36:16] godog: deployment_target currently includes aqs too right? [13:36:40] mobrovac: that's correct [13:36:44] kk [13:37:09] good, but that's just plain wrong :P [13:37:45] hehe depends on your point of view I guess, restbase-the-service vs restbase-the-framework [13:37:59] anyways I'll merge the service::node [13:38:14] ok [13:38:14] (03PS10) 10Filippo Giunchedi: RESTBase: Switch to service::node [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [13:38:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: Switch to service::node [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [13:39:19] and testing the other change with pcc [13:40:41] ah you'll do it, cool, thnx [13:41:19] heh urandom seems to be having the same vps conn problems as me [13:44:23] (03PS4) 10Filippo Giunchedi: RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [13:44:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "LGTM https://puppet-compiler.wmflabs.org/1499/" [puppet] - 10https://gerrit.wikimedia.org/r/259460 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [13:44:50] godog: \o/ [13:45:02] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 10.53% of data above the critical threshold [100000000.0] [13:45:10] hehe so I'll start with restbase-test and friends [13:45:35] I can help with the sc[ab]* crowd [13:45:44] I expect the change to be a noop anyway [13:45:45] ;-) [13:46:22] hehe thanks akosiaris, yeah me too [13:46:51] !log depooling sca1002, scb1002 for citoid, cxserver, graphoid, mathoid [13:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:47:02] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [13:47:33] akosiaris: ahem I was too trigger happy and ran puppet on sca1001 just now [13:47:42] !log (retroactive) enable puppet on sca1001 [13:47:44] lol [13:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:47:52] anything bad ? [13:47:59] or are we ok ? [13:48:28] we're ok, the change is minimal but refreshed the affected services [13:48:46] with sca1002 depooled - ouch [13:48:53] ok, lemme see what happened cause those were probably the only ones around [13:49:14] sorry about that heh :| [13:50:26] !log enable puppet on restbase-test2001 and bounce restbase [13:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:42] "apiRetError":{"code":"invalidhash","info":"No graph found.","docref":"See https://it.wikipedia.org/w/api.php for API usage"},"levelPath":"info/mwapi-error" ... [13:50:45] sigh [13:50:57] mobrovac should be happy to see this :P [13:51:04] akosiaris: yeah, that's yurik ^^ [13:51:08] akosiaris: thrilled :) [13:51:26] mobrovac, wa? [13:51:58] akosiaris, i'm updating many existing graphs, so graphoid obviously fails to get the older graphs [13:52:06] yurik: see the mail on ops-l that says to look at this channel :P [13:52:07] if the html was cached on the client [13:52:24] akosiaris: I'll hold off touching sc[ab] btw and do aqs/restbase to avoid repeating that again [13:52:33] mobrovac, i'm looking )) [13:52:34] godog: ok [13:52:44] yurik: the problem is in graphoid's spec which requests the graphs with the old ids (before you changed them manually) [13:53:24] mobrovac: mathoid is also complaining about message":"400: bad_request","stack":"HTTPError: 400: bad_request\n at emitError (/srv/deployment/mathoid/deploy/src/routes/mathoid.js:38:11 and so on but that happens before the change anyway, so irrelevant for this, but I 'll file a task [13:53:28] unless one already exists [13:53:41] akosiaris: no no [13:53:56] mobrovac, the fundamental problem is that we cannot regenerate a graph if we only know its hash, because page save deletes the old version from the database [13:54:01] akosiaris: that's expected, there's a test in mathoid's spec to test for bad reqs / invalid formulae [13:54:34] akosiaris: we do have a problem with mathoid on sca, but let's discuss it later, after we finish this [13:54:42] mobrovac: well, 400 bad request and a stack trace does not exactly sound like the best answer for such a test [13:54:49] mobrovac: ok [13:54:54] so the $1m question is how do we store all previous graphs, at least until they expire from cache [13:55:24] yurik: that really isn't relevant for the current monitoring failures [13:55:24] !log restarting HHVM on canary appservers to effect various security updates (libxml, openssl, gs, freetype) [13:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:32] !log enable puppet on cerium and bounce restbase [13:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:30] !log repool sca1002 for citoid, mathoid, graphoid, cxserver [13:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:47] btw, cxserver depooling was not needed. I just got carried away [13:59:43] mobrovac, re logs, can't we simply change the hash in your test? I'm sorry I didn't realize you were testing using that hash [13:59:47] (03CR) 10Muehlenhoff: [C: 031] diamond: Add openldap collector [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [13:59:48] !log depool scb1001 for mobileapps (oid) [13:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:52] yurik: you are the maintainer of graphoid, which happens to include a monitoring spec used prod, it has been requested that you fix it so please do it [14:01:19] mobrovac, i thought you gave your own graph for it? checking [14:02:22] yurik: also, as i said in the mail, doing such changes requires coordination, doing it on your own wastes everybody's time (including your own) [14:02:26] oh well [14:02:40] as the germans would say, so typisch [14:03:58] mobrovac: cerium LGTM, the logs show messages from the cassandra driver a while ago but afaict those are the hosts pre-multi instance [14:04:36] godog: lemme do some sample queries there and run the monitoring script [14:04:37] !log repool scb1001 for mobileapps [14:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:01] godog: I am done on my part. It was a noop indeed. need help with anything else ? [14:05:30] godog: looking good [14:05:41] akosiaris: so sc[ab]* all good? [14:05:42] ncie! [14:05:52] mobrovac: affirmative [14:05:53] akosiaris: thanks! nah I'll finish up aqs/restbase and that's it [14:06:07] akosiaris: thnx! [14:06:10] so https://grafana-admin.wikimedia.org/dashboard/db/graphoid [14:06:23] there's an empty panel down at the bottom [14:06:30] godog: i think we can enable puppet in all of staging at this point [14:06:34] not sure why I even bother ... [14:06:40] :) [14:06:47] !log delete empty panel in https://grafana-admin.wikimedia.org/dashboard/db/graphoid [14:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:55] akosiaris: iirc, that was due to a grafana bug [14:07:11] ah the mess before grafana 2 ? [14:07:17] yeah those were nasty [14:07:33] but I met that bug too and made a conscious effort to clean up afterwards [14:08:54] !log reenable puppet on restbase test cluster [14:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:39] mobrovac: I'm reenabling puppet everywhere and running it, do you mind rolling restart restbase afterwards? should be ok [14:16:02] godog: i'd prefer we firt do only rb1001 and then mass-reenable [14:16:17] sure, that works too [14:16:40] k, lemme do it [14:17:04] !log restbase reenabling and running puppet on canary rb1001 [14:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:39] godog: btw, does puppet automatically reload systemd after changing a unit service file? [14:18:13] mobrovac: yes [14:18:17] kk [14:18:42] <_joe_> mobrovac: only if you use base::system_unit though [14:19:26] (03CR) 10Filippo Giunchedi: [C: 031] diamond: Add openldap collector [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [14:19:29] _joe_: we're smart enough to use it :P [14:20:12] godog: hmm, just noticed in the systemd config: [14:20:19] +Restart=always [14:20:19] +RestartSec=2s [14:20:58] * mobrovac checking in ops/puppet [14:21:52] indeedly [14:21:55] godog: that's hardcoded in the service::node systemd template, i'll need to make an extra patch making it conditional [14:22:05] coming right up [14:22:52] I know I sound like a broken record, but the real solution there is make restbase dumber on startup [14:23:23] <_joe_> "dumber" [14:24:57] (03PS3) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/258981 [14:24:59] (03PS1) 10Giuseppe Lavagetto: Made locking optional as it might slow down syncing significantly [software/conftool] - 10https://gerrit.wikimedia.org/r/259492 [14:25:05] (03PS1) 10Mobrovac: service::node: Do not force a service restart if auto_refresh == false [puppet] - 10https://gerrit.wikimedia.org/r/259493 [14:26:57] mobrovac, do i need to deploy graphoid to fix the test? [14:27:28] (03PS2) 10Mobrovac: service::node: Do not force a service restart if auto_refresh == false [puppet] - 10https://gerrit.wikimedia.org/r/259493 [14:28:37] yurik: yup [14:30:13] (03CR) 10Mobrovac: "https://puppet-compiler.wmflabs.org/1501/ confirms it works as it's supposed to." [puppet] - 10https://gerrit.wikimedia.org/r/259493 (owner: 10Mobrovac) [14:30:20] godog: ^^^ [14:31:00] (03CR) 10Filippo Giunchedi: [C: 04-1] "I wouldn't expect auto_refresh to affect a service auto restart behaviour, perhaps an additional parameter would be more obvious" [puppet] - 10https://gerrit.wikimedia.org/r/259493 (owner: 10Mobrovac) [14:31:09] aye [14:32:00] godog: should we simply rename that one to service_restart and have it include all of the possible restarts? [14:33:56] mobrovac: in my mind those are two different behaviours, ie. puppet behaviour vs systemd/upstart behaviour [14:33:57] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1884447 (10Ottomata) Awesome, thank you! [14:34:25] godog: k [14:40:12] (03PS3) 10Mobrovac: service::node: Configure automatic service restarts with init_restart [puppet] - 10https://gerrit.wikimedia.org/r/259493 [14:41:09] !log starting `nodetool cleanup' on restbase1006.eqiad (https://phabricator.wikimedia.org/T121535) [14:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:57] 6operations, 10Wikimedia-Mailing-lists: Create a new mailing list: elf@lists.wikimedia.org - https://phabricator.wikimedia.org/T120523#1884459 (10Fourandsixty) Thank you, Dzahn, much appreciated. And thank you to everyone who commented. [14:42:11] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1884460 (10Eevans) [14:42:54] (03CR) 10Mobrovac: "Introduced a new param - init_restart. Compiling ok - https://puppet-compiler.wmflabs.org/1502/ ." [puppet] - 10https://gerrit.wikimedia.org/r/259493 (owner: 10Mobrovac) [14:43:07] godog: ^^^ [14:44:05] (03CR) 10Hashar: "Thank you Andrew for the rebase / beta cluster update :-}" [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [14:45:05] (03CR) 10Faidon Liambotis: "Does this need to be statically assigned?" [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) (owner: 10Dzahn) [14:45:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/259493 (owner: 10Mobrovac) [14:46:14] (03PS4) 10Muehlenhoff: Set idle_timelimit for nslcd [puppet] - 10https://gerrit.wikimedia.org/r/259256 [14:46:33] !log starting `nodetool cleanup' on restbase1009-a.eqiad (https://phabricator.wikimedia.org/T121535) [14:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:16] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1884464 (10Eevans) [14:47:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Set idle_timelimit for nslcd [puppet] - 10https://gerrit.wikimedia.org/r/259256 (owner: 10Muehlenhoff) [14:49:37] !log restbase restart rb1001 [14:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:56] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1884482 (10Eevans) [14:50:46] godog: looking good! [14:50:59] godog: enable puppet on all of them and i'll roll-restart? [14:51:19] (03CR) 10Faidon Liambotis: [C: 04-1] diamond: Add openldap collector (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [14:51:53] mobrovac: yup, LGTM [14:52:39] !log reenable and run puppet on restbase* and aqs* [14:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:40] mobrovac: all done [14:55:52] kk thnx godog! [14:56:04] (03PS7) 10Ottomata: Using more generic roles for kafka classes, configuring new main brokers kafka[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/258220 (https://phabricator.wikimedia.org/T120957) [14:56:55] !log restbase roll-restarting restbase [14:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:28] (03PS8) 10Ottomata: Using more generic roles for kafka classes, configuring new main brokers kafka[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/258220 (https://phabricator.wikimedia.org/T120957) [15:01:26] (03PS2) 10Muehlenhoff: Uninstall ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/256650 [15:04:06] !log setting db2023 master to be now db1049 instead of m5-master [15:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:19] (03CR) 10Ottomata: [C: 032] Using more generic roles for kafka classes, configuring new main brokers kafka[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/258220 (https://phabricator.wikimedia.org/T120957) (owner: 10Ottomata) [15:05:23] godog: all looking good! :) [15:06:42] mobrovac: sweet! [15:08:41] (03PS1) 10Ottomata: Fix undefined variable access [puppet] - 10https://gerrit.wikimedia.org/r/259502 [15:08:57] (03PS2) 10Ottomata: Fix undefined variable access [puppet] - 10https://gerrit.wikimedia.org/r/259502 [15:09:18] (03CR) 10Ottomata: [C: 032 V: 032] Fix undefined variable access [puppet] - 10https://gerrit.wikimedia.org/r/259502 (owner: 10Ottomata) [15:09:40] !log roll-restart swift daemons on ms-be1* [15:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:38] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: puppet fail [15:10:39] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: puppet fail [15:11:11] (03PS1) 10Ottomata: Fixing broker names for main-codfw cluster config [puppet] - 10https://gerrit.wikimedia.org/r/259503 [15:11:23] (03CR) 10Ottomata: [C: 032 V: 032] Fixing broker names for main-codfw cluster config [puppet] - 10https://gerrit.wikimedia.org/r/259503 (owner: 10Ottomata) [15:11:48] (03PS1) 10Ottomata: Fixing one more broker name in codfw [puppet] - 10https://gerrit.wikimedia.org/r/259504 [15:12:05] (03CR) 10Ottomata: [C: 032 V: 032] Fixing one more broker name in codfw [puppet] - 10https://gerrit.wikimedia.org/r/259504 (owner: 10Ottomata) [15:12:19] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: puppet fail [15:12:29] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 4 failures [15:13:52] (03PS1) 10Ottomata: Set up proper kafka log dirs for main clusters [puppet] - 10https://gerrit.wikimedia.org/r/259506 [15:14:52] (03PS2) 10Ottomata: Set up proper kafka log dirs for main clusters [puppet] - 10https://gerrit.wikimedia.org/r/259506 [15:14:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Uninstall ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/256650 (owner: 10Muehlenhoff) [15:15:27] (03PS3) 10Muehlenhoff: Uninstall ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/256650 [15:15:34] (03CR) 10Muehlenhoff: [V: 032] Uninstall ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/256650 (owner: 10Muehlenhoff) [15:16:28] RECOVERY - graphoid endpoints health on sca1001 is OK: All endpoints are healthy [15:16:30] (03PS3) 10Ottomata: Set up proper kafka log dirs for main clusters [puppet] - 10https://gerrit.wikimedia.org/r/259506 [15:16:34] !log removed ecryptfs-utils across the cluster [15:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:08] RECOVERY - graphoid endpoints health on sca1002 is OK: All endpoints are healthy [15:17:47] !log updated graphoid service [15:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:31] (03PS4) 10Ottomata: Set up proper kafka log dirs for main clusters [puppet] - 10https://gerrit.wikimedia.org/r/259506 [15:21:16] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1884516 (10Pchelolo) a:3Pchelolo [15:22:11] (03CR) 10Ottomata: [C: 032] Set up proper kafka log dirs for main clusters [puppet] - 10https://gerrit.wikimedia.org/r/259506 (owner: 10Ottomata) [15:23:25] (03CR) 10Dzahn: ""is_test_machine is not a very nice name. It is not descriptive of what it does"" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [15:24:27] so, i have 2 approaches to just stop the test machine from paging us [15:24:34] one has one +1 and one -1 [15:24:44] so i made a new patch to reply to the comment there [15:24:50] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:10] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1884522 (10Eevans) There isn't //much// space to be gained from clearing old snapshots, but the ones I have looked at appear to be quite old and AFIAK, Not... [15:25:15] 6operations, 10RESTBase, 6Services, 5Patch-For-Review: Switch RESTBase to use service::node - https://phabricator.wikimedia.org/T118401#1884524 (10mobrovac) 5Open>3Resolved Merged && deployed, resolving. [15:26:49] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:52] (03CR) 10Andrew Bogott: "It is descriptive, but it could be proscriptive: 'suppress_paging' would actively describes what the flag does. As long as this flag onl" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [15:26:56] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1884526 (10mobrovac) >>! In T121535#1884522, @Eevans wrote: > There isn't //much// space to be gained from clearing old snapshots, but the ones I have look... [15:27:35] (03CR) 10Dzahn: "it was on purpose to call it that. the "test machine means don't page" thing is just a consequence of it being marked as a test machine. t" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [15:28:21] !log restarting and reconfiguring mysql on db2028 [15:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:00] akosiaris: funny, i could have sworn if i did "disable_paging" somebody would have said "but there might be more things than just paging that change if it's test" [15:29:14] (03CR) 10Andrew Bogott: "yeah, that's fine with me -- a general descriptive flag seems good if there's potential to use it elsewhere." [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [15:29:32] (03PS1) 10Ottomata: Set group_prefix with Kafka cluster_name for jmx metrics [puppet] - 10https://gerrit.wikimedia.org/r/259512 [15:35:00] (03PS2) 10Ottomata: Set group_prefix with Kafka cluster_name for jmx metrics, make some variables local [puppet] - 10https://gerrit.wikimedia.org/r/259512 [15:35:59] (03CR) 10jenkins-bot: [V: 04-1] Set group_prefix with Kafka cluster_name for jmx metrics, make some variables local [puppet] - 10https://gerrit.wikimedia.org/r/259512 (owner: 10Ottomata) [15:36:26] mutante: yeah, I figured that. Then again, that approach hasn't worked very well, hasn't it ? let's go for the "one thing for a job" approach and worst case scenario we create one that wraps around many [15:36:38] PROBLEM - Kafka Broker Server on kafka2002 is CRITICAL: NRPE: Command check_kafka not defined [15:36:39] mutante: somehow I am pretty sure none will show up [15:36:53] (03PS3) 10Ottomata: Set group_prefix with Kafka cluster_name for jmx metrics, make some variables local [puppet] - 10https://gerrit.wikimedia.org/r/259512 [15:37:00] ha! an alert! [15:37:05] <_joe_> ottomata: why is this paging? [15:37:20] i should ahve marked for downtime [15:37:22] PROBLEM - puppet last run on mw2028 is CRITICAL: CRITICAL: Puppet has 1 failures [15:37:27] they are new production kafka brokers [15:37:42] akosiaris: ok! it's easy to convince me. amending [15:37:43] kafka pages as well ? [15:37:45] the codfw ones are down, i actually should just un puppetize them [15:37:54] akosiaris: for a downed broker, yes [15:38:08] PROBLEM - Kafka Broker Server on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [15:38:21] sorry [15:39:33] (03CR) 10Ottomata: [C: 032] Set group_prefix with Kafka cluster_name for jmx metrics, make some variables local [puppet] - 10https://gerrit.wikimedia.org/r/259512 (owner: 10Ottomata) [15:39:58] (03PS1) 10Ottomata: Don't puppetize codfw main kafka cluster yet, we need a zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/259514 [15:40:18] (03CR) 10Ottomata: [C: 032 V: 032] Don't puppetize codfw main kafka cluster yet, we need a zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/259514 (owner: 10Ottomata) [15:41:38] removing puppetization and alerts for the 2 kafka codfw hosts [15:42:15] ottomata: it will also need puppetstoredconfigclean.rb to make them disappear [15:43:41] yeah i did [15:43:49] and am running puppet on neon now [15:44:13] cool [15:44:32] (03PS1) 10Ottomata: Fix group prefix for jmx metrics [puppet] - 10https://gerrit.wikimedia.org/r/259517 [15:44:48] (03CR) 10Ottomata: [C: 032 V: 032] Fix group prefix for jmx metrics [puppet] - 10https://gerrit.wikimedia.org/r/259517 (owner: 10Ottomata) [15:46:57] PROBLEM - cassandra service on restbase1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [15:47:31] (03PS1) 10BBlack: Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends" [puppet] - 10https://gerrit.wikimedia.org/r/259518 (https://phabricator.wikimedia.org/T121564) [15:47:48] (03PS1) 10Jcrespo: Repool db1022, depool db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259519 [15:47:50] (03PS2) 10BBlack: Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends" [puppet] - 10https://gerrit.wikimedia.org/r/259518 (https://phabricator.wikimedia.org/T121564) [15:48:00] (03CR) 10BBlack: [C: 032 V: 032] Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends" [puppet] - 10https://gerrit.wikimedia.org/r/259518 (https://phabricator.wikimedia.org/T121564) (owner: 10BBlack) [15:49:36] PROBLEM - cassandra CQL 10.64.32.160:9042 on restbase1004 is CRITICAL: Connection refused [15:50:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor inline comments. It's starting to look good" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [15:51:09] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1884589 (10Ottomata) [15:51:37] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1881640 (10Ottomata) @cmjohnson, I checked off all the boxes I was sure was done. If you check off the other ones, we can close this ticket! :) Thank you so much f... [15:52:13] (03CR) 10Jcrespo: [C: 032] Repool db1022, depool db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259519 (owner: 10Jcrespo) [15:52:44] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1884591 (10Ottomata) ok! looking good! We may have a problem for codfw -- we don't have a zookeeper cluster there. For some reason I assumed there was already a... [15:53:57] !log cassandra on restbase1004 couldn't finish decomissioning, out of disk space, running nodetool clean [15:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:10] (03CR) 10Alexandros Kosiaris: diamond: Add openldap collector (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [15:54:48] (03PS3) 10Alexandros Kosiaris: diamond: Add openldap collector [puppet] - 10https://gerrit.wikimedia.org/r/258491 [15:54:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1022, depool db1041 (duration: 00m 30s) [15:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:56] RECOVERY - cassandra service on restbase1004 is OK: OK - cassandra is active [15:55:56] RECOVERY - Disk space on restbase1004 is OK: DISK OK [15:56:06] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:15] and thanks to races on etcd-based backend lists and puppet-based ones, there might be a spam of those cp puppetfails [15:56:16] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:17] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:17] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:24] they're not important, they'll fix on the next run (which is already happening) [15:57:05] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:05] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:15] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:16] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:36] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:37] RECOVERY - cassandra CQL 10.64.32.160:9042 on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [15:57:46] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:46] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:47] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:55] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:55] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:55] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:57] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 1 failures [15:58:05] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet has 1 failures [15:58:06] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:17] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:18] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:58:18] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:58:27] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [15:58:35] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:58:36] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 1 failures [15:58:37] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:06] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:59:06] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:59:17] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:17] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:17] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:59:25] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:36] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:59:47] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:47] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:55] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151216T1600). Please do the needful. [16:00:05] yurik: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:06] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:07] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:00:17] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [16:00:27] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:00:27] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:36] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:00:36] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:27] Hmm, just the one config change this AM. [16:01:27] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:01:30] yurik: you about? [16:01:35] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [16:02:06] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:02:06] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:06] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:02:06] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:02:06] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:02:07] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:03:19] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:05:01] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:05:01] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:05:09] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:05:10] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:05:10] RECOVERY - puppet last run on mw2028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:05:20] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:19:34] (03CR) 10BryanDavis: "We need the account to have the same uid on all deploy servers. This is required by our chrooted rsync setup and the need to mirror /srv/m" [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) (owner: 10Dzahn) [16:20:21] moritzm: that updated report on the etherpad… it’s just from monitoring port 389 right? [16:20:34] I’m trying to reproduce the traffic from bastion-restricted-01.bastion.eqiad.wmflabs [16:20:37] and can't [16:20:49] (03PS1) 10Jcrespo: Reconfigure mysql at db1041 and all s7 codfw slaves [puppet] - 10https://gerrit.wikimedia.org/r/259523 [16:21:16] (03PS6) 10BryanDavis: scap: change l10nupdate UID from 10002 to 120 [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) (owner: 10Dzahn) [16:22:05] (03CR) 10Jcrespo: [C: 032] Reconfigure mysql at db1041 and all s7 codfw slaves [puppet] - 10https://gerrit.wikimedia.org/r/259523 (owner: 10Jcrespo) [16:23:02] 6operations, 10Deployment-Systems, 5Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1884698 (10bd808) [16:23:02] (03CR) 10Luke081515: [C: 031] Set password policy for global sysadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259436 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [16:24:15] (03CR) 10Luke081515: [C: 031] Set password policy for global steward group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259439 (https://phabricator.wikimedia.org/T104371) (owner: 10CSteipp) [16:25:33] andrewbogott: 389 and 636 [16:25:41] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1884703 (10GWicke) I wonder where these snapshots come from. We don't have any regular snapshotting set up, so these should be triggered manually. It would... [16:26:17] (03CR) 10Catrope: [C: 031] "LGTM, but I've never used the computer dblists feature" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 (owner: 10Mattflaschen) [16:26:33] andrewbogott: and I can double-check the dump or send you the excerpt later, I'm out for an hour or so [16:26:42] ok [16:29:08] moritzm: I was thinking something along the lines of “the files are right, maybe I should restart nslcd” but then I couldn’t see the traffic at all [16:29:24] ostriches, yep [16:30:11] Alrighty. Just that one config change? [16:31:10] !log restarting and reconfiguring mysql on db1041 [16:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:17] ostriches, for now, yes [16:31:29] (03CR) 10Chad: [C: 032] Disable 'Graph:' ns for meta and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259448 (owner: 10Yurik) [16:32:15] (03Merged) 10jenkins-bot: Disable 'Graph:' ns for meta and mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259448 (owner: 10Yurik) [16:32:20] ostriches, do you have access to collabwiki [16:32:29] I do not. [16:32:41] ostriches, do you remember how to do a sql query on prod db? [16:32:42] !log restarted `nodetool decommission` on restbase1004 [16:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:59] yurik: I do. What needs doing here? [16:33:05] need to check if any pages exist in that namespace [16:33:14] checking the number... [16:33:25] Ah, one sec. [16:33:29] What's the #? [16:34:04] ostriches, 484 [16:34:20] ostriches, i also need to check that for another site, what's the shell command? [16:34:32] i know we had it documented somewhere [16:34:47] yurik: `sql ` [16:34:51] 1 page on collabwiki [16:34:56] from tin? [16:35:02] meh, i need to get access to that site [16:35:15] or just get the content from db ) [16:35:39] tin or terbium [16:35:43] 1 page, 1 revision [16:35:50] probably junk [16:35:55] but we should check [16:36:06] i guess i should go via the regular channel of getting access and oll [16:36:09] all [16:37:33] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1884758 (10RobH) a:5mark>3RobH [16:38:41] yurik: Well requesting a password reset didn't work, guess I don't have access after all. [16:38:56] ostriches, i think its a private wiki [16:39:26] ostriches, yet, i just discovered something else - i forgot that 484 and 486 ns get declared together [16:39:40] so now there are a few 486 pages that are not accessible [16:39:51] could you revert it until i clean it up? [16:39:55] sorry to bug you [16:40:20] ostriches, or better yet, i will commit another patch [16:40:40] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1884760 (10RobH) Thanks! I've allocated system WMF4576 for this request. I'll create the sub-tasks for its setup. [16:41:50] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255135 (https://phabricator.wikimedia.org/T118570) (owner: 10EBernhardson) [16:41:54] !log cleared snapshots on cassandra cluster [16:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:46] (03PS1) 10Yurik: Re-enabled mediawikiwiki to wmgUseGraphWithNamespace - need to clean up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259524 [16:42:52] ostriches, ^ [16:43:12] metawiki is clean [16:45:06] Ah duh, I had never sync'd the original patch [16:45:09] no wonder you were able to clean [16:45:19] Second patch not needed, I'll just sync the first. [16:45:33] (03CR) 10DCausse: [C: 031] [elasticsearch] Collect cluster health stats about shard movement [puppet] - 10https://gerrit.wikimedia.org/r/259443 (https://phabricator.wikimedia.org/T117284) (owner: 10EBernhardson) [16:46:08] yurik: Er, mediawikiwiki, not metawiki [16:46:12] Too many wikis. [16:46:13] ! [16:46:29] ostriches, ? [16:46:43] ostriches, sec, i basically need to remove the metawiki [16:46:55] mediawikiwiki need to stay until i get rid of the pages in it [16:48:03] (03CR) 10Chad: [C: 032] Re-enabled mediawikiwiki to wmgUseGraphWithNamespace - need to clean up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259524 (owner: 10Yurik) [16:48:05] Ok I get it [16:48:16] thx :) [16:48:54] (03Merged) 10jenkins-bot: Re-enabled mediawikiwiki to wmgUseGraphWithNamespace - need to clean up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259524 (owner: 10Yurik) [16:49:44] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1884770 (10RobH) 3NEW a:3RobH [16:50:03] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: graph namespace removal/cleanup (duration: 00m 31s) [16:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:09] yurik: Ok, that's both patches ^ [16:50:26] ostriches, awesome, thanks! i will clean it up shortly [16:52:13] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1884790 (10RobH) a:5RobH>3MoritzMuehlenhoff I've assigned this task to @MoritzMuehlenhoff, as it is pending his shipment of the YubiHSM module to @Papaul in codfw. Once it is shipped, he can update this task... [16:52:22] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1884792 (10RobH) 5Open>3stalled [16:52:45] 6operations, 10ops-codfw: rack/setup/deploy auth2001 as codfw auth system - https://phabricator.wikimedia.org/T120263#1884794 (10RobH) [16:53:31] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1884795 (10GWicke) I went ahead and cleared snapshots across the cluster. [16:57:56] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1884825 (10RobH) [17:00:20] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1884830 (10GWicke) For the record, if disk space is tight from compactions or cleanups, the best way to temporarily clean up space is to run `nodetool stop... [17:03:38] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1884856 (10RobH) p:5Triage>3Normal [17:05:17] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1814545 (10RobH) [17:05:18] 6operations, 10ops-codfw: rack/setup/deploy auth2001 as codfw auth system - https://phabricator.wikimedia.org/T120263#1884864 (10RobH) [17:06:00] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1884865 (10RobH) 5Open>3Resolved Both the codfw and eqiad systems have been allocated and have their own tasks for deployment are T120263 & T121655. Resolving this request as its b... [17:08:32] !log setting mysql db1022 as db2028's master [17:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:31] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1884880 (10RobH) [17:13:27] moritzm: I’m heading to lunch and to run some errands. If you’re able to reproduce that traffic just drop me an email about how to reproduce. thx [17:13:34] (03PS1) 10RobH: adding production and updating mgmt entries for auth1001 [dns] - 10https://gerrit.wikimedia.org/r/259530 [17:14:25] (03CR) 10RobH: [C: 032] adding production and updating mgmt entries for auth1001 [dns] - 10https://gerrit.wikimedia.org/r/259530 (owner: 10RobH) [17:21:06] andrewbogott: ok, will ping you later [17:21:18] !log restarting and configuring mysql on db2029 (there will be an increase of errors- from pings, not real traffic) [17:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:25] errors on 10.192.16.17 are normal, and not a problem [17:24:28] those are around 1000/minute [17:29:32] (03CR) 10DCausse: [C: 031] Cron job to rebuild completion indices [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T112028) (owner: 10EBernhardson) [17:32:19] (03PS4) 10Faidon Liambotis: diamond: Add openldap collector [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [17:32:38] (03CR) 10Faidon Liambotis: [C: 032 V: 032] diamond: Add openldap collector [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [17:36:22] the error logging has finished [17:39:04] (03CR) 10Faidon Liambotis: [C: 04-1] "- How does it work with mwdeploy then?" [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) (owner: 10Dzahn) [17:39:59] (03CR) 10Faidon Liambotis: [C: 031] "Not sure why this needs my review, but +1 regardless :)" [puppet] - 10https://gerrit.wikimedia.org/r/259057 (owner: 10RobH) [17:40:27] paravoid: i wasnt sure if anything had changed recently in certificate chains, so i added you =] [17:40:36] okay [17:40:38] it's fine :) [17:40:42] but since then have merged others and know its unchanged, but thanks! =] [17:41:07] I've rolled half the new renewals for january already, i plan to do the remainder shortly [17:41:31] (these are all microservices and not core infrastructure anyhow but i want to try to have them all done this week) [17:45:10] (03PS1) 10Jcrespo: Repool db1041 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259532 [17:46:28] !log setting mysql db1041 as db2029's master [17:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:52:24] 7Blocked-on-Operations, 6operations, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1884977 (10faidon) Status update: - As of late last week, we have 3.11 packages prepared *in the upstream Debian pkg-hhvm repository* (to which @joe, myself, and David from Fac... [17:56:28] 6operations, 6Labs, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1884995 (10jcrespo) 3NEW [17:57:55] 6operations, 6Labs, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1885006 (10jcrespo) [17:59:54] jynus: save timings got a lot better during "Repool db1022" (https://grafana.wikimedia.org/dashboard/db/save-timing) [18:00:17] at least the timing is right on [18:00:33] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly - https://phabricator.wikimedia.org/T121594#1885027 (10KLans_WMF) p:5Triage>3Normal [18:00:38] (03CR) 10MaxSem: OSM replication for maps (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [18:00:56] 6operations, 6Labs, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1885029 (10Krenair) 5Open>3Invalid a:3Krenair not an operational issue [18:01:13] AaronSchulz, bblack and mark have been doing things all day long, they are probably more responsible for it getting better [18:02:57] check on backlog "BBlack: Revert \"cache_text/mobile: send randomized pass traffic directly to t1 backends\"" [18:03:42] I keep thinking I am relevant, but my actions, worst case possible will only affect a 4% of a 10% of a 1/7 pf the load [18:04:34] (03PS12) 10MaxSem: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [18:04:55] haha nice, spot on compared to last week now [18:05:23] yeah, the start of it doesn't match with any obvious DB things, just the end of it (the regression). Not setting anything else in SAL though (maybe unlogged stuff happened then?). [18:05:42] most likely it's the varnish revert we did [18:05:51] mark: was that at 15:50? [18:06:12] but the commit isn't really the issue, it's more like a catalyst that made a different underlying issue more obvious, and we haven't really figured out the details of that other underlying though, so revert the catalyst [18:06:32] AaronSchulz: yes [18:06:45] bblack: yeah the idea of your original commit made sense to me [18:06:48] indeed, we think we've just started amplifying an existing problem... [18:07:00] PROBLEM - puppet last run on mw2122 is CRITICAL: CRITICAL: puppet fail [18:07:02] the problem with SAL...unlogged stuff ;) [18:07:19] not that I never did that myself [18:07:22] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1885046 (10RobH) [18:07:58] (03PS1) 10Muehlenhoff: Bump connection limit to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/259534 [18:17:24] (03PS1) 10RobH: setting auth1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/259535 [18:18:47] (03CR) 10RobH: [C: 032] setting auth1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/259535 (owner: 10RobH) [18:20:14] 6operations, 6Labs, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1885079 (10Legoktm) Instead of deleting, please use templates like https://wikitech.wikimedia.org/wiki/Template:Archive [18:20:28] PROBLEM - MariaDB Slave SQL: s6 on db2046 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 4 of table ruwiki.geo_tags cannot be converted from type decimal(11,8) to type float [18:21:10] hm? [18:21:30] jynus: ? [18:22:28] * AaronSchulz looks at https://gerrit.wikimedia.org/r/#/c/258465/ [18:22:48] ah, didn't see that one, I was thinking of the central auth related one [18:23:31] looks like one of db2028 slaves, and 17:08setting mysql db1022 as db2028's master [18:25:45] yes [18:25:48] I'm on it [18:26:21] that is independent of that, maybe only discovered by that [18:28:51] master: `gt_lat` decimal(11,8) DEFAULT NULL, [18:29:02] slave: `gt_lat` float NOT NULL, [18:29:05] not good [18:31:51] who can a schema change that was done on the master not reach one, and only 1 slave: https://phabricator.wikimedia.org/T89986 ? [18:31:56] *how [18:33:11] RECOVERY - puppet last run on mw2122 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:34:29] not only it is only on that server, it is only on that wiki [18:39:08] !log performing schema change on ruwiki.geo_tags on db2046 [18:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:47] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly - https://phabricator.wikimedia.org/T121594#1885143 (10dr0ptp4kt) Unfortunately (or fortunately), this issue is no longer reproducible. The latest edit is now showing on properly showing on mdot. Is anyone see... [18:44:36] (03CR) 10Alexandros Kosiaris: [C: 031] "Sounds fine to me. very detailed commit message, nice" [puppet] - 10https://gerrit.wikimedia.org/r/259534 (owner: 10Muehlenhoff) [18:45:40] (03CR) 10Alexandros Kosiaris: "doubtful it will be used anywhere else. And if the need shows up, let's handle it then." [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [18:54:46] 6operations, 6Labs, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1885191 (10jcrespo) No, I had already archived the old page, I needed to delete it because, for some reason, it didn't all... [18:57:39] (03PS5) 10Dzahn: icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 [18:59:21] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:00:05] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151216T1900). Please do the needful. [19:00:12] 6operations, 10DBA, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1885232 (10jcrespo) All the main cross-datacenter links have been switched to TLS (s1-7 and es2-3). As I mentioned before, it will take some time for the changes to be rolled in to all servers. [19:00:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:00:47] !log updating group1 wikis to 1.27.0-wmf.9 [19:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:31] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:02:41] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:02:51] (03PS1) 10Thcipriani: group1 wikis to 1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259538 [19:05:43] (03CR) 10Thcipriani: [C: 032] "Train" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259538 (owner: 10Thcipriani) [19:06:08] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259538 (owner: 10Thcipriani) [19:06:52] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1885317 (10Smalyshev) As I understand it, Pageview API allows to have data about one article or aggregated data about se... [19:07:03] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.9 [19:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:41] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly - https://phabricator.wikimedia.org/T121594#1885329 (10GWicke) A possible reason for this would be a backlog in the job queue. To check for the number of current jobs per wiki, you can run this on tin: `mwscrip... [19:08:10] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: puppet fail [19:09:27] Train deploy for group1 wikis to 1.27.0-wmf.9 completed. [19:12:19] (03CR) 10Dzahn: "amended to use "do_paging" as suggested by Alex" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [19:18:14] bd808: how do I get INFO level message into logstash for a certain channel? [19:19:14] that should be the default threshold, so you just need the channel to be in $wmgMonologChannels (may not be the exactly right global name but close) [19:19:56] if you have a channel that is logging to logstash and info isn't showing up that probably means I/we excluded it for being too noisy [19:20:11] Krinkle, did you took the time to edit the image? [19:20:53] bd808: hey, can you tell me what exactly the daily sync-l10n run does? [19:21:12] (the answer i am hoping for is "backports messages from master to wmf branches") [19:21:24] bd808: so https://gerrit.wikimedia.org/r/#/c/259454/ will already be enough to get these events there then it looks [19:21:34] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1885359 (10RobH) [19:22:44] MatmaRex: it "exactly" runs this script -- https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/l10nupdate-1 -- and I believe that yes the intent of that process is to merge the latest i18n messages from master into the l10n caches for the deployed branches [19:23:26] AaronSchulz: yeah, but do you really need those in logstash? [19:24:00] that basically makes us record a log event for every lookup right? [19:24:07] bd808: thanks, that's wonderful [19:25:20] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:26:45] bd808: I'll add some statsd calls...I wonder if that just makes these redundant then [19:27:38] AaronSchulz: if you are just looking for time based trends I think graphite is the right place to track that, so +1 [19:28:42] chasemp, twentyafterfour: I'm getting 503 responses from phabricator.wikimedia.org [19:28:50] +1 [19:29:09] and this time is not the database [19:29:11] Same. [19:29:15] It's just spinning for me. [19:29:18] * ostriches looks [19:29:22] * bd808 takes that as a queue to get lunch [19:29:54] bd808: indeed [19:29:56] im on iridium but i dont want to step on anyones troubleshooting [19:30:04] but indeed, no phab [19:30:22] back [19:30:25] robh: not sure what the problem is, I'm tempted to just bump apache? [19:30:32] yea [19:30:40] thats what i'd do as well, to eliminate potential there [19:30:41] back for me [19:30:44] wfm [19:30:46] oh [19:30:53] apache just restarted a minute or so ago [19:30:53] 300+ loadavg on iridium [19:30:55] looks like there are some repository operations using a lot of cpu [19:30:58] [Wed Dec 16 19:29:09.204478 2015] [mpm_prefork:notice] [pid 25193] AH00171: Graceful restart requested, doing restart [19:30:58] [Wed Dec 16 19:29:13.796521 2015] [mpm_prefork:notice] [pid 25193] AH00163: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.13 configured -- resuming normal operations [19:30:58] twentyafterfour: so i wouldnt do it now [19:31:01] but that shouldn't break everything [19:31:15] but yep, its back [19:31:18] Something requested an apache restart? [19:31:20] (03PS1) 10Dzahn: icinga: disable paging for test hosts [puppet] - 10https://gerrit.wikimedia.org/r/259540 [19:31:24] wasn't me [19:31:27] there are apache+php processes spinning on a full CPU core [19:31:38] ostriches: yes [19:31:45] Couple of segfaults about 15 minutes ago but otherwise error.log is fairly quiet. [19:31:53] hmm [19:31:53] twentyafterfour: tons of pid 583 sec 79 state G client 10.64.0.106 host phabricator.wikimedia.org:80 uri GET /diffusion/MW/browse/REL1_25/includes/parser/Parser.php;1c4 [19:32:04] all of the same uri [19:32:17] hmmm [19:32:27] the spinning php is doing: [19:32:27] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [19:32:28] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 [19:32:28] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [19:32:28] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 [19:32:30] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [19:32:30] chasemp: crawler run amock again? [19:32:37] (03CR) 10Dzahn: "follow-up would be https://gerrit.wikimedia.org/r/#/c/259540/1" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [19:32:43] twentyafterfour: I thought diffusion was not allowed? [19:32:44] over and over and over [19:32:51] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:32:54] yeah https://phabricator.wikimedia.org/robots.txt [19:32:58] chasemp: right, but not everything necessarily respects robots.txt [19:33:01] yeah [19:33:02] That ^ [19:33:11] bblack: kill that spinning proc I guess? [19:33:18] Let's have a look at access.log and see who it is [19:33:26] it's gone now but I believe that was it [19:33:37] I put https://phabricator.wikimedia.org/T112776 yesterday as high, I know it is not your fault [19:33:43] fwiw I am using [19:33:45] python /home/rush/apachetop.py --host 127.0.0.1 [19:34:16] chasemp: that views apache real time request status? [19:34:21] 1450294446.403721 poll([{fd=4, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) [19:34:23] "POST /api/feed.query HTTP/1.0" 200 449 "-" "python-requests/ [19:34:24] 1450294446.403815 write(4, "_\0\0\0\3SELECT * FROM `repository_s"..., 99) = 99 [19:34:27] 1450294446.403882 read(4, "\1\0\0\1\6`\0\0\2\3def\26phabricator_reposi"..., 16384) = 697 [19:34:30] seems like some database select times out [19:34:44] I could kill the spinning proc, but it's just going to recur probably [19:34:54] yeah [19:35:07] there is a lot of POSTing to api/feed.query [19:35:10] Diffusion has a few places where it can potentially spiral out of control [19:35:13] 1450294438.098559 read(4, "bled\":true,\"svn-subpath\":null,\"d"..., 16384) = 14480 [19:35:16] 1450294438.098721 read(4, "ve-over-ssh\":\"readonly\"}\n1417809"..., 16384) = 8352 [19:35:19] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1885376 (10RobH) [19:35:19] 1450294438.563425 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14826, si_status=0, si_utime=0, si_stime=0} --- [19:35:22] the feed.query is wikibugs [19:35:22] from something using python requests [19:35:22] 1450294438.563509 rt_sigreturn() = 0 [19:35:23] twentyafterfour: yeah it uses http://localhost/server-status [19:35:25] 1450294446.403721 poll([{fd=4, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) [19:35:28] 1450294446.403815 write(4, "_\0\0\0\3SELECT * FROM `repository_s"..., 99) = 99 [19:35:32] twentyafterfour: gotcha, ok [19:35:33] ^ 8 second pause there on something to do with that JSON fetch from a DB [19:35:40] twentyafterfour: we really gotta get that realip stuff going [19:35:43] but https://phabricator.wikimedia.org/P2428 [19:35:48] 1450294438.089471 write(4, "=\0\0\0\3SELECT `r`.* FROM `reposito"..., 65) = 65 [19:35:53] chasemp: unblocked upstream isn't it? [19:35:57] ^ was the select before the big read that ended up pausing [19:36:05] twentyafterfour: yeah should be [19:36:09] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1885377 (10RobH) a:5RobH>3MoritzMuehlenhoff I've assigned this task to @MoritzMuehlenhoff, as it is pending his shipment of the YubiHSM module to @cmjohnson in codfw. Once it is shipped, he can updat... [19:36:10] now [19:36:17] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1885380 (10RobH) [19:36:18] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1885379 (10RobH) 5Open>3stalled [19:36:29] bblack: Do you have that full query btw? [19:36:33] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [19:36:35] * ostriches would like to know what repo [19:36:35] 1450294576.225000 write(4, "=\0\0\0\3SELECT `r`.* FROM `repository` r ORDER BY `r`.`id` DESC ", 65) = 65 [19:36:39] ^ full query [19:36:59] ty [19:37:01] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:37:11] bblack: that looks rather hairy [19:37:24] hmm [19:37:26] that seems like a consequence of the /diffusion traffic [19:37:31] that query returns a ton of data in read() calls, then eventually the read traffic pauses for 8s and then it times out and aborts [19:38:16] during the pause, it spins with no delay on the rt_sigprocmask calls eating up CPU [19:38:25] > UNRECOVERABLE FATAL ERROR <<<\n\nMaximum execution time of 30 seconds exceed ed\n\n/srv/phab/phabricator/src/applications/repository/graphcache/PhabricatorRepositoryGraphCache.php [19:38:32] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:38:44] graphs about repos? [19:39:05] phabricator_repository(.repository) is the diffusion db [19:39:14] I killed pid 40454, the one doing all the bad stuff [19:39:28] and a new php process took its place doing the same thing [19:39:37] yeah so running browse on a large file starts doing postbacks to get blame [19:39:43] and that shit spirals out of control [19:39:59] maybe the functionality can be disabled? [19:40:04] * twentyafterfour just loaded https://phabricator.wikimedia.org/diffusion/MW/browse/REL1_25/includes/parser/Parser.php which triggered the same behavior [19:40:09] there's another one just now [19:40:11] EXCEPTION: (DiffusionRefNotFoundException) [19:40:15] twentyafterfour: also https://phabricator.wikimedia.org//diffusion/MW/browse/REL1_25/includes/parser/Parser.php;1c4 [19:40:19] seems liek nonexistent line [19:40:21] the problem is that web git blame on a large file with large history isn't really scalable [19:40:22] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:41:07] I mean disabling parts of it to make it rock solid, then investigate [19:41:11] hmm I'm gonna ask #phabricator [19:41:11] "Maximum execution time of 30 seconds exceed [19:41:13] (03PS1) 10Ottomata: Move role::scap::target to scap::target [puppet] - 10https://gerrit.wikimedia.org/r/259542 [19:41:13] " [19:41:23] can we configure that max exec time much lower, like 5s? might help [19:41:27] jynus: yeah that'd be good, though I don't know how to disable exactly [19:41:40] I'm going to repack that repository. iridium It got missed when I was doing tight repacks last week of MW and puppet. [19:41:43] we could block all diffusion URLs [19:41:51] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [19:41:58] but we can't really be selective on file size / history [19:42:00] bblack: as a temp fix maybe, but a lot of maniphest requests probably use 5 secs or more (especially file uploads...) [19:45:08] seems like 45.32.160.62 [19:45:31] not sure if that was the first heavy diffusion hitter but my thinking is they have moved on to another uridiffusion/MW/browse/master/maintenance/oracle/tables.sql [19:45:55] (03PS1) 10Dzahn: phabricator: lower max execution time to 10s [puppet] - 10https://gerrit.wikimedia.org/r/259544 [19:45:56] ^ [19:46:16] !log iridium: repacking MW / OPUP repositories as phd user. This needs to be a cron. [19:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:36] twentyafterfour: ^ 10s maybe? [19:47:14] mutante: yeah that'd be good for now to keep things sane [19:47:28] fwiw, there is also "Maximum amount of time each script may spend parsing request data." [19:47:33] but that's 60 [19:47:58] (03PS1) 10BBlack: temporarily block diffusion browse [puppet] - 10https://gerrit.wikimedia.org/r/259545 [19:48:16] ^ that may work as a temporary hack for now [19:48:25] (03PS52) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:48:28] should i go ahead with the lower max exec? [19:48:53] (03PS53) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:49:18] mutante: I would [19:49:39] (03PS2) 10Dzahn: phabricator: lower max execution time to 10s [puppet] - 10https://gerrit.wikimedia.org/r/259544 [19:49:49] any objection to blocking /diffusion/whatever/browse/ completely for now? [19:49:56] (03CR) 10Dzahn: [C: 032] phabricator: lower max execution time to 10s [puppet] - 10https://gerrit.wikimedia.org/r/259544 (owner: 10Dzahn) [19:50:26] lots of "client denied by server config" now [19:50:35] yeah that was me [19:50:39] 'k [19:51:36] still lots of php/git CPU on iridium, but the major 503 spikes were fairly short [19:51:44] so i merged but it's not applied yet [19:51:49] I donno, I'll leave that patch up in case someone wants to apply it [19:51:51] because puppet is disabled [19:51:57] (03PS54) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:51:58] chasemp: lemme know [19:52:08] sure just give me a moment [19:53:41] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [19:53:47] ostriches: the git pack-ojects stuff is something you triggered? [19:53:55] Yes, almost done with that. [19:53:57] I !log'd [19:54:04] (03PS55) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:55:08] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [19:55:24] (03PS56) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:56:13] now a lot of this pid 5459 sec 0 state C client 10.64.32.134 host phabricator.wikimedia.org:80 uri NULL [19:56:26] chasemp: The repacking we did on gerrit/gitblit last week was never done on iridium. Big repos -> slow repos. [19:56:29] Almost done now. [19:56:46] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [19:56:52] 45.32.160.62 [19:56:56] (03PS57) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:57:03] so I'm pretty sure it's that IP causing teh majority of the ruckus [19:57:06] at least at the moment [19:57:13] but I see them hammering away at diffusion [19:57:21] my attempt to block by x-client-ip blew up in my face [19:57:36] anyone w/ the apache.conf chops want to take a look at doing that there? [19:58:04] ostriches: twentyafterfour^ [19:58:16] * ostriches looks [19:58:18] we as of a moment ago were still being hit intermittently [19:59:10] X-Client-IP [19:59:15] 45.32.160.62 [19:59:16] still them [19:59:22] (03CR) 10BryanDavis: "Posted some responses to Faidon's latest questions at T120585#1885495 which seems like a nicer forum for discussing the general issue than" [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) (owner: 10Dzahn) [19:59:27] pid 1838 sec 2 state K client 10.64.32.133 host phabricator.wikimedia.org:80 uri GET /diffusion/MW/browse/master/includes/api/i18n/qqq.json%3Bc7 [19:59:28] type stuff [20:00:27] (03PS58) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:01:59] chasemp: I used SetEnvIf Remote_Addr bad_browser in git.wm.o before [20:02:04] chasemp: /etc/hosts.deny ? [20:02:12] But that probably doesn't work anymore since it's behind varnish tbh. [20:02:22] right [20:02:28] And then deny from env=bad_browser [20:03:09] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1885526 (10Milimetric) >>! In T116312#1883286, @yuvipanda wrote: > This should probably be isolated in its own vl... [20:03:14] ok I'm goign to reset and let mutante's patch land here [20:03:16] and then see [20:03:26] it's the blame that causes the browse page to be expensive, it does a post-back to fetch blame info. for Parser.php, git blame took 7 seconds to execute, so that's pretty expensive [20:03:57] yes this is either a crawler or a bot [20:04:38] if it's doing the post to get blame then it's a smart-ish bot, not just a dumb http-get indexer... [20:04:48] chasemp, twentyafterfour: https://phabricator.wikimedia.org/P2429 [20:05:00] e.g. it must be executing client side javascript or just intentionally fetching blame data [20:05:28] ostriches: damn that's a big savings... [20:07:02] still took 6.3 seconds to run git blame on Parser.php [20:07:33] Yeah, blame's always going to be slow on ancient and big files. [20:07:37] Only so much you can optimize there. [20:08:25] The file's 11 years old :p [20:09:32] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:11:37] (03PS59) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:13:58] so one problem is phabricator requests blame automatically (via async http post) [20:14:35] twentyafterfour: can you load https://phab-01.wmflabs.org/ now? [20:16:27] (03PS60) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:19:08] chasemp: yes [20:23:03] (03CR) 10Andrew Bogott: [C: 031] "Yep, convincing explanation :)" [puppet] - 10https://gerrit.wikimedia.org/r/259534 (owner: 10Muehlenhoff) [20:24:20] (03CR) 10Andrew Bogott: [C: 031] icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [20:25:25] (03CR) 10Andrew Bogott: [C: 031] icinga: disable paging for test hosts [puppet] - 10https://gerrit.wikimedia.org/r/259540 (owner: 10Dzahn) [20:31:35] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [20:33:14] chasemp: so on phab-01 the problem doesn't manifest, it seems? [20:33:37] twentyafterfour: I don't follow [20:33:40] I get a fast response from https://phab-01.wmflabs.org/diffusion/MW/browse/REL1_25/includes/parser/Parser.php and even the blame loads in like 3 seconds [20:33:48] ah [20:34:44] well, not quite 3 seconds, I see where I went wrong [20:38:46] (03PS6) 10Dzahn: icinga: add logic to avoid paging for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259319 [20:39:36] (03CR) 10Dzahn: [C: 032] "to confirm i'm copying puppet_services.cfg to compare before and after" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [20:42:51] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [20:48:54] (03PS61) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:49:01] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [20:51:15] (03CR) 10Alexandros Kosiaris: OSM replication for maps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [20:51:25] (03PS1) 10Andrew Bogott: wikidatabuilder: Use require_package to avoid duplicate package conflicts. [puppet] - 10https://gerrit.wikimedia.org/r/259552 [20:51:52] (03PS2) 10Dzahn: icinga: disable paging for test hosts [puppet] - 10https://gerrit.wikimedia.org/r/259540 [20:53:06] (03CR) 10Alexandros Kosiaris: OSM replication for maps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [20:53:11] (03PS62) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:53:37] akosiaris, in my testing, PGPASSWORD worked:) [20:53:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "2 unaddressed comments in PS11 that still apply" [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [20:54:30] (03CR) 10Andrew Bogott: [C: 032] wikidatabuilder: Use require_package to avoid duplicate package conflicts. [puppet] - 10https://gerrit.wikimedia.org/r/259552 (owner: 10Andrew Bogott) [20:54:51] MaxSem: that's disconcerting... it deviates from the man page [20:55:14] not surprised, it's not exactly a well maintained tool [20:55:30] the volkerschatz one? it's not an official manual by any means [20:55:38] MaxSem: mind testing if PGPASS also works so at least we are consistent with the manpage [20:55:45] no the jessie one [20:55:50] the one shipped with the software [20:56:49] (03CR) 10MaxSem: OSM replication for maps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [20:59:02] akosiaris, I'm locked outta puppet-test02.maps-team.eqiad.wmflabs due to LDAP migration - is there a quick way to fix it, or I should just create another instance? [21:00:05] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151216T2100). Please do the needful. [21:00:24] Krenair: where do the visual editor people hang out? And/or is towtruck.visualeditor.eqiad.wmflabs. still good for anything? [21:00:25] win 4 [21:01:23] andrewbogott, #mediawiki-visualeditor [21:01:28] I don't know who is responsible for towtruck [21:01:29] (03PS63) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [21:01:32] (03PS13) 10MaxSem: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [21:01:44] I use a different instance in that project to do everything, no idea what the others do [21:02:15] Krenair: thank you [21:02:35] I spent a few brief, disorienting minutes in wikimedia-ve [21:02:55] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [21:04:17] (03PS64) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [21:05:48] Krenair: andrewbogott towtruck is cscott's project [21:06:15] YuviPanda: I’ve hit two boxes in a row that use singlenode, both broken with "Resource type mw-extension doesn't exist" [21:06:20] Is that something you already know the fix for? [21:07:18] I don't actually know where that comes from, which is baffling. it's not in current puppet, and it's not in LDAP [21:07:26] yeah [21:07:57] I perseonally think we should let mediawiki_singlenode instances decay and die. it's been years [21:08:01] and most are probably unused [21:08:29] towtruck doesn't let me in as root either [21:09:57] so IMO a message to cscott saying 'hey this instance you have not maintained in a while is dead' and then shutting it down [21:10:04] require => [ Exec['mediawiki_setup','mediawiki_update'], File["${install_path}/privacy-policy.xml", "${install_path}/LocalSettings.php"], Mw-extension[ 'Nuke', 'SpamBlacklist', 'ConfirmEdit' ] ] [21:10:10] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1885844 (10ori) >>! In T116312#1885526, @Milimetric wrote: >>>! In T116312#1883286, @yuvipanda wrote: >> This sho... [21:10:16] is there a better way to include nike, spamblacklist, confirmedit? [21:10:22] oooh [21:10:27] *nuke [21:11:38] (03PS1) 10Yuvipanda: Bandage the undead [puppet] - 10https://gerrit.wikimedia.org/r/259556 [21:11:39] andrewbogott: ^ [21:11:54] include_once( "Nikerabbit" ); [21:12:00] 6operations, 10ops-eqiad: update physical label for auth1001(WMF4576) - https://phabricator.wikimedia.org/T121703#1885857 (10RobH) 3NEW a:3Cmjohnson [21:12:38] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1885866 (10ori) >>! In T116312#1881789, @Joe wrote: > I think piwik will need to have its own database hosted on... [21:13:12] https://www.irccloud.com/pastebin/Dorn6HAh/ [21:14:10] !log starting mobileapps deploy [21:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:30] YuviPanda: ^ I don’t see where Mw_extension is defined [21:14:34] andrewbogott: ugh [21:15:16] I https://gerrit.wikimedia.org/r/#/c/259169/ [21:16:09] (03PS2) 10Yuvipanda: Bandage the undead [puppet] - 10https://gerrit.wikimedia.org/r/259556 [21:16:09] updated [21:16:15] I really think they all should be just shut down [21:16:22] +1 [21:16:43] people can't expect to externalize their costs to us forever [21:16:56] andrewbogott: unless you strenuously object, I'm going to shut down towtruck now and tell cscott [21:16:56] (03PS3) 10Andrew Bogott: Use Mwextension instead of the old name Mw-extension [puppet] - 10https://gerrit.wikimedia.org/r/259556 (owner: 10Yuvipanda) [21:17:05] YuviPanda: I’d really rather ask him first [21:17:08] well [21:17:10] I did [21:17:12] the last time I had to do this [21:17:19] and what did he say? [21:17:21] the esponse was basically 'we might need it in the future' [21:17:26] well, then you gotta maintain it... [21:17:35] move it to labs_Vagrant or mediawiki_vagrant and not not use it for years [21:17:40] If you warned him already then have at [21:17:45] ok [21:17:52] we need to do an audit and kill a lot of these [21:18:06] the great Editor Engagement Team disburssal left a lot of instances in similar states too [21:18:13] with some emotional attachment from people but no time [21:18:19] starting parsoid deploy [21:18:26] (03PS4) 10Andrew Bogott: Use Mwextension instead of the old name Mw-extension [puppet] - 10https://gerrit.wikimedia.org/r/259556 (owner: 10Yuvipanda) [21:19:18] !log starting parsoid deploy [21:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:20:08] (03PS65) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [21:20:30] (03CR) 10Andrew Bogott: [C: 032] Use Mwextension instead of the old name Mw-extension [puppet] - 10https://gerrit.wikimedia.org/r/259556 (owner: 10Yuvipanda) [21:22:01] MaxSem: it seems that somehow you broke the automatic puppet rebase on that host. I 've manually rebased it (I did keep all your changes, including the uncommited ones) but it's working now. [21:22:24] awesome, cheers akosiaris [21:23:13] !log restarted parsoid on wtp1005 as a canary [21:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:32] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [21:25:38] looking good. restarting parsoid on all nodes. [21:27:36] (03CR) 10Smalyshev: [C: 031] [elasticsearch] Collect cluster health stats about shard movement [puppet] - 10https://gerrit.wikimedia.org/r/259443 (https://phabricator.wikimedia.org/T117284) (owner: 10EBernhardson) [21:29:26] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1885922 (10RobH) 5Open>3Resolved Both tasks for the deployment of these systems are at the service... [21:29:31] 6operations, 6Analytics-Kanban, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1885924 (10RobH) [21:29:45] !log mobileapps deployed sha1 9f91ad5 [21:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:30:15] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1885926 (10Ottomata) YEehaw, thank you! [21:31:01] (03CR) 10Andrew Bogott: [C: 031] "Looking forward to purging all this code :)" [puppet] - 10https://gerrit.wikimedia.org/r/259226 (owner: 10Muehlenhoff) [21:31:15] (03PS1) 10Rush: phabricator: garbage collect user logs at 30 days [puppet] - 10https://gerrit.wikimedia.org/r/259560 (https://phabricator.wikimedia.org/T114014) [21:31:27] !log finished deploying parsoid 64029e12 [21:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:05] (03PS2) 10Rush: phabricator: garbage collect user logs at 30 days [puppet] - 10https://gerrit.wikimedia.org/r/259560 (https://phabricator.wikimedia.org/T114014) [21:32:40] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1885930 (10Nuria) Sounds like a VM is the way to go. [21:34:08] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1885940 (10Cmjohnson) [21:34:20] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1885943 (10Cmjohnson) [21:34:23] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1885941 (10Cmjohnson) 5Open>3Resolved Completed [21:35:24] (03CR) 1020after4: [C: 031] phabricator: garbage collect user logs at 30 days [puppet] - 10https://gerrit.wikimedia.org/r/259560 (https://phabricator.wikimedia.org/T114014) (owner: 10Rush) [21:38:34] (03CR) 10Rush: [C: 032] phabricator: garbage collect user logs at 30 days [puppet] - 10https://gerrit.wikimedia.org/r/259560 (https://phabricator.wikimedia.org/T114014) (owner: 10Rush) [21:44:59] bd808: is the ‘stashbot’ project still active? [21:45:13] andrewbogott: yes, very much so [21:45:53] (03PS1) 10Yuvipanda: ores: Stop using aof for redis persistance [puppet] - 10https://gerrit.wikimedia.org/r/259593 [21:46:25] bd808: ok. Puppet seems dead on several instances, I’ll see what I can find. [21:46:31] I'm working on moving it into Tool Labs. That's probably not going to be done for another week or two though [21:46:43] ok if I rebase puppet on stashbot-deploy? [21:46:45] ugh. it has a self-hosted puppetmaster [21:46:48] yeah [21:47:00] I think there is one local cherry-pick [21:47:39] andrewbogott: can you make a note of all the responses you get from people when you ask somewhere? :) [21:47:55] unltess it's just 'kill it' in which case we can just kill it [21:48:05] um… maybe? [21:48:09] andrewbogott: I think this -- https://gerrit.wikimedia.org/r/#/c/236714/ -- was the only weird puppet patch there [21:48:14] It will be very haphazard [21:48:43] andrewbogott: just an etherpad or something. we have to do a audit soon, so it'll be helpful? [21:48:50] yeah [21:48:54] bd808: 404? [21:49:26] YuviPanda: hmm.. it's a draft but that shouldn't make it 404 should it [21:49:33] oh, or 'permission not denied' [21:49:42] it said '404 or permision denied, can not tell you lol' [21:49:48] heh [21:49:53] sekure! [21:50:14] segure [21:50:25] it's a role that adds an apache vhost [21:50:27] YuviPanda: https://etherpad.wikimedia.org/p/labs-cleanup-2015 [21:50:43] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:51:01] thanks andrewbogott [21:51:06] hm on stashbot-logstash puppet agent -tv just hangs forever [21:51:23] ok, I shall do attempt #6 of getting out of the blanket into the cold world. brb [21:51:29] never seen that one before [21:53:56] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1885982 (10Mattflaschen) [21:54:40] andrewbogott: is it causing problems with bad ldap queries or something? I can try to fix it if there is something urgent about it [21:54:58] bd808: well right now you can’t log into anything but the puppetmaster, I predict [21:55:02] that is what I am trying to fix [21:55:41] you are correct [21:55:55] which puppet would be fixing if only it didn’t hang [21:57:03] (03CR) 10Dzahn: [C: 032] icinga: disable paging for test hosts [puppet] - 10https://gerrit.wikimedia.org/r/259540 (owner: 10Dzahn) [21:57:18] (03PS3) 10Dzahn: icinga: disable paging for test hosts [puppet] - 10https://gerrit.wikimedia.org/r/259540 [21:57:30] (03PS1) 10Cmjohnson: Removing dhcp entries for ores1001 and ores1002. Swapping for better suited h/w. [puppet] - 10https://gerrit.wikimedia.org/r/259595 [21:58:03] (03PS1) 10Ottomata: Also start/stop/restart keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/259596 [21:58:19] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1885991 (10RobH) Ok, the cleanup of spares (via another task, they were migrated into a sheet for tracking and re-audited) has resulted in my finding a few potential systems for this. @akosiaris a... [21:58:27] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1885997 (10RobH) a:5mark>3akosiaris [22:00:15] (03PS1) 10Cmjohnson: Removing dns entries for ores1001 and ores1002 [dns] - 10https://gerrit.wikimedia.org/r/259597 [22:00:59] bd808: do you mind looking at ferm and/or security groups to see why 8140 is blocked on stashbot-deploy? [22:01:08] (That is, presumably, why its clients are hanging) [22:01:26] (03CR) 10Cmjohnson: [C: 032] Removing dhcp entries for ores1001 and ores1002. Swapping for better suited h/w. [puppet] - 10https://gerrit.wikimedia.org/r/259595 (owner: 10Cmjohnson) [22:02:17] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for ores1001 and ores1002 [dns] - 10https://gerrit.wikimedia.org/r/259597 (owner: 10Cmjohnson) [22:02:56] andrewbogott: sure. I'm going to bet that it includes a role where base::firewall has been added directly to the role rather than to hosts that use the role in site.pp [22:03:09] probably something related to deplloyment/trebuchet [22:04:18] yeah it has base::firewall applied... not to figure out why [22:09:00] andrewbogott: grrr... role::deployment::server unconditionally includes base::firewall [22:09:30] bd808: moritzm added the ferm rules for deployment/scap iirc [22:09:55] yeah which breaks anything in Labs that uses the role combined with other things [22:10:06] because ferm is fun like that [22:10:38] beta had ferm rules for a while though so not much impact there [22:11:02] but on labs project using some subset of puppet classes, that could indeed sneakily enable ferm which is no fun [22:11:49] (03CR) 10BryanDavis: "Breaks stashbot-deploy and any other Labs hosts where the Trebuchet, Salt and Puppet roles have been combined into a single instance." [puppet] - 10https://gerrit.wikimedia.org/r/250091 (owner: 10Dzahn) [22:12:16] (03PS1) 10BryanDavis: Revert "tin,mira: move base::firewall to deployment role" [puppet] - 10https://gerrit.wikimedia.org/r/259599 [22:12:24] sigh.. is the root problem that we never have the same roles as in prod? [22:12:36] (03CR) 10jenkins-bot: [V: 04-1] Revert "tin,mira: move base::firewall to deployment role" [puppet] - 10https://gerrit.wikimedia.org/r/259599 (owner: 10BryanDavis) [22:12:50] base::firewall really doesn't belong in roles if we are going to get reuse [22:12:53] the other classes that are combined with it need holes? [22:12:58] yeah [22:13:03] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [22:13:12] which could be literally anything [22:13:30] in this case it's at least the self-hosted puppetmaster role [22:13:39] we need a hiera variable: ferm: false :D [22:13:51] why.. why is the fix to just not have any firewalling [22:14:28] mutante: sure, if we add ferm rules to every role and module we have then we can turn it on everywhere [22:14:41] well, that's what we do [22:14:57] for prod [22:15:18] but on labs you would most probably have a class that does not have any ferm rule to whitelist a port; So you end up blocked :-/ [22:15:39] because everything is different [22:15:45] i dont get it [22:15:52] this is at least the second time and I think the third actually that "works in prod" has messed up Labs things I run with ferm being moved into a role rather than being left in site.pp [22:15:53] i do understand that it should be on the node though [22:16:15] (03PS1) 10Cmjohnson: Removing ores mgmt hostnames [dns] - 10https://gerrit.wikimedia.org/r/259600 [22:16:49] (03CR) 10Andrew Bogott: [C: 031] "ok with me if it's ok with jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/259599 (owner: 10BryanDavis) [22:18:02] (03CR) 10Cmjohnson: [C: 032] Removing ores mgmt hostnames [dns] - 10https://gerrit.wikimedia.org/r/259600 (owner: 10Cmjohnson) [22:18:09] (03PS2) 10BryanDavis: Revert "tin,mira: move base::firewall to deployment role" [puppet] - 10https://gerrit.wikimedia.org/r/259599 [22:19:55] (03CR) 10Dzahn: [C: 032] Revert "tin,mira: move base::firewall to deployment role" [puppet] - 10https://gerrit.wikimedia.org/r/259599 (owner: 10BryanDavis) [22:20:43] mutante: this particular "everything is different" is not a beta cluster vs prod problem. This is a completely unrelated to beta or prod project in Labs that is trying to use ops/puppet resources to manage the project configuration. [22:21:08] If it was a prod v beta cluster difference I would be totally in favor of fixing beta cluster. [22:21:37] But having 3 different VMs in my project to allow Trebuchet to work is really lame and a waste of resources [22:22:22] So when I build these little projects I smush the puppet, salt and deploy server roles into a single m1.small vm [22:23:05] (03PS1) 10Ori.livneh: [WIP] Add piwik module and role [puppet] - 10https://gerrit.wikimedia.org/r/259601 (https://phabricator.wikimedia.org/T103577) [22:24:17] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add piwik module and role [puppet] - 10https://gerrit.wikimedia.org/r/259601 (https://phabricator.wikimedia.org/T103577) (owner: 10Ori.livneh) [22:24:26] mutante: also, thank you for the quick +2 :) [22:25:30] bd808: np, the intention was just to have no difference between it for once [22:25:44] and i'd think we want to fix the other roles [22:26:16] they are clearly missing some ferm rules then [22:26:41] i also feel like we have done this before [22:26:50] but maybe it was the IPv6 interface [22:27:26] agreed. The next time I work on these boxes I'll see if I can track down the places that are missing ferm rules [22:28:26] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1886139 (10aaron) [22:28:45] 6operations, 10ops-eqiad: Rack 8 new misc servers - https://phabricator.wikimedia.org/T121578#1886143 (10Cmjohnson) [22:29:03] 6operations, 10ops-eqiad, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka1001 & kafka1002 - https://phabricator.wikimedia.org/T121553#1886146 (10Cmjohnson) [22:29:04] bd808: thank you and sorry for the breakage [22:29:05] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1886147 (10Cmjohnson) [22:29:07] 6operations, 10ops-eqiad: Rack 8 new misc servers - https://phabricator.wikimedia.org/T121578#1886144 (10Cmjohnson) 5Open>3Resolved racked and ready [22:32:12] bd808: just the *third* time? :) [22:32:44] YuviPanda: I was trying to be semi-fact based in case I was hit with {{cn}} [22:32:54] heh [22:33:23] did anyone reported the hack on en.wp yet? [22:34:13] Alchimista: hack? [22:34:43] (03CR) 10Mattflaschen: [C: 04-1] "We're no longer trying to remove the cache layer. See https://phabricator.wikimedia.org/T94029#1842533 ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) (owner: 10Matthias Mullie) [22:34:43] oops, sorry :s [22:35:24] legoktm opening some pages there's a message: FACT: Wikipedia administrators are repugnant sexual deviants willing to travel across continents in order to engage in sexual liaisons with each other. Body size, cleanliness, gender, etc are irrelevant to them. Don't believe me? See this pastebin for more information. [22:35:36] blerg [22:35:45] another template vandalism? [22:35:53] Alchimista: do you have an example page [22:36:08] can i pastbin the html code? [22:36:08] (03CR) 10Dzahn: "if there is no guarantee that this UID is not used by something else, doesn't that apply to all UIDs in all puppet manifests?" [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) (owner: 10Dzahn) [22:36:13] seems not in all pages [22:36:13] (03Abandoned) 10Dzahn: scap: change l10nupdate UID from 10002 to 120 [puppet] - 10https://gerrit.wikimedia.org/r/259441 (https://phabricator.wikimedia.org/T120585) (owner: 10Dzahn) [22:36:43] https://en.wikipedia.org/wiki/J._J._Abrams [22:37:01] Alchimista: thanks [22:37:26] (03PS1) 10Yuvipanda: labs: Remove nfs_mounts params [puppet] - 10https://gerrit.wikimedia.org/r/259602 [22:38:06] Alchimista: reverted [22:38:29] bd808: I think you’ll probably have to manually tear down that firewall on stashbot-deploy; puppet doesn’t know to undo it [22:39:01] andrewbogott: yeah, thanks for reminding me [22:39:07] legoktm, wich template had the code? [22:39:23] ori: https://gerrit.wikimedia.org/r/#/c/259593/1 (ores redis) [22:39:26] Alchimista: https://en.wikipedia.org/w/index.php?title=Template:NYT_topic&diff=next&oldid=695555481 [22:39:32] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:39:39] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1886186 (10Pchelolo) Created a PR in for a statd client we're using https://github.com/sivy/node-statsd/pull/61 to support batching messages. If they don't answer in a re... [22:39:58] and there's someone having fun with this? o.O [22:40:41] sadly, yes [22:42:04] I stuck the image on the bad image list too [22:43:12] bd808: ferm --flush Clears the firewall rules and sets the policy of all chains to ACCEPT. (as opposed to iptables -flush which could lock you out) [22:43:25] if that was about ferm [22:43:39] mutante: I just did `service ferm stop` [22:43:51] Seems like the right solution would be to make the puppetmaster::self role explicitly open the ports it needs ? [22:43:51] which also seems to kill the iptables rules cleanly [22:43:59] andrewbogott: yes [22:44:02] am I late to the the party? Was that solution laready considered? [22:44:07] bd808: ok, cool [22:44:45] andrewbogott: but my larger point with base::firewall is that you never know what it will break next [22:45:08] It will, by default, break everything! [22:45:13] I think it is very useful to be applied at the site.pp level but not fun in a role [22:45:38] bd808: In a perfect world it would be included in base [22:45:49] then we’d never be surprised because every class would have explicit rules [22:45:50] ideally all roles open the holes they need, so any combination of them will work. but if they don't we should keep it on nodes [22:45:55] * bd808 looks for a time machine [22:46:21] in this case i honestly just assumed that there is an instance that includes the same roles as mira and tin a [22:46:42] I think the long-term plan is to add it to base, and in the meantime noticing things it breaks is… progress. [22:48:30] yea, "in the role" seems like a step between node and base.. maybe [22:48:56] !log puppet disabled on netmon1001 for librenms certificate update [22:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:49:24] (03PS2) 10RobH: new librenms.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259057 [22:50:43] andrewbogott: what holes does a puppetmaster need? I can't find ferm rules for a puppetmaster anywhere (maybe bad grepping) [22:51:24] 8140 [22:51:28] and… http, maybe? [22:51:35] (03CR) 10RobH: [C: 032] new librenms.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259057 (owner: 10RobH) [22:51:39] no, wait, that’s on a weird port too, hang on [22:51:51] anyone deploying in the next hour? I will claim it [22:51:52] role::puppetmaster::backend has some [22:51:58] 8141 and 22 [22:53:30] bd808: andrewbogott welcome to the clusterfuck that is the puppet module [22:53:41] different from but almost an exact copy of base::puppet and puppetmaster [22:53:49] unfortunately we dont have them on palladium yet either [22:54:00] here’s a start… https://docs.puppetlabs.com/pe/latest/install_system_requirements.html#firewall-configuration [22:54:09] mutante: doesn’t palladium use apache/passenger? [22:54:15] andrewbogott: yes, it does [22:54:23] so, totally different from an in-labs puppetmaster I think [22:54:28] https://phabricator.wikimedia.org/T120159 has more details [22:54:36] andrewbogott: it's a totally different module [22:54:47] YuviPanda: yes, I mean — it uses different ports [22:54:54] * YuviPanda nods [22:54:59] due to running a puppetmaster rather than relying on http hooks [22:57:09] 8140 and 8141 tcp for prod [22:58:13] yeah, that sounds right to me [22:58:35] they might appear in the role but are not actually applied yet https://phabricator.wikimedia.org/T113344 [22:58:35] !log librenms returned to normal service (puppet renabled on netmon1001) [22:58:37] (03PS1) 10BryanDavis: Add ferm to puppet::self::master [puppet] - 10https://gerrit.wikimedia.org/r/259608 [22:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:58:47] (03PS2) 10Yuvipanda: labs: Remove nfs_mounts params [puppet] - 10https://gerrit.wikimedia.org/r/259602 [22:58:48] the "palladium" task is still open [22:58:49] (03PS2) 10Yuvipanda: ores: Stop using aof for redis persistance [puppet] - 10https://gerrit.wikimedia.org/r/259593 (https://phabricator.wikimedia.org/T121658) [22:58:50] !log disabled puppet on fermium for the lists.w.o cert update [22:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:59:24] (03PS66) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [22:59:33] (03PS4) 10Dzahn: icinga: disable paging for test hosts [puppet] - 10https://gerrit.wikimedia.org/r/259540 [22:59:39] rebase race [23:00:04] yurik: Respected human, time to deploy Graph ext (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151216T2300). Please do the needful. [23:00:35] (03PS2) 10RobH: lists.wikimedia.org certificate update [puppet] - 10https://gerrit.wikimedia.org/r/259053 (https://phabricator.wikimedia.org/T120237) [23:00:57] (03CR) 10RobH: [C: 032] lists.wikimedia.org certificate update [puppet] - 10https://gerrit.wikimedia.org/r/259053 (https://phabricator.wikimedia.org/T120237) (owner: 10RobH) [23:01:37] grrrit-wm: nah.. nah? [23:01:51] (03PS3) 10RobH: lists.wikimedia.org certificate update [puppet] - 10https://gerrit.wikimedia.org/r/259053 (https://phabricator.wikimedia.org/T120237) [23:02:11] (03CR) 10Ori.livneh: [C: 031] ores: Stop using aof for redis persistance [puppet] - 10https://gerrit.wikimedia.org/r/259593 (https://phabricator.wikimedia.org/T121658) (owner: 10Yuvipanda) [23:07:27] (03PS67) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [23:07:37] (03PS14) 10MaxSem: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [23:07:50] !log fermium returned to normal service with new/renewed ssl cert for lists.w.o [23:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:52] !log running foreachwiki extensions/CentralAuth/maintenance/checkLocalUser.php --verbose=1 --delete=1 (T119736) [23:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:57] (03PS1) 10Aaron Schulz: Set $wgCentralAuthUseSlaves in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259611 [23:13:47] (03CR) 10Andrew Bogott: Add ferm to puppet::self::master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259608 (owner: 10BryanDavis) [23:15:55] (03CR) 10BryanDavis: [C: 04-1] Add ferm to puppet::self::master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259608 (owner: 10BryanDavis) [23:19:06] (03PS2) 10BryanDavis: Add ferm to puppet::self::master [puppet] - 10https://gerrit.wikimedia.org/r/259608 [23:20:22] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures [23:22:57] (03CR) 10Dzahn: Add ferm to puppet::self::master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259608 (owner: 10BryanDavis) [23:23:12] <% if @server_type == 'standalone' or @server_type == 'frontend' -%> [23:23:18] Listen 8140 [23:23:24] <% if @server_type == 'backend' or @server_type == 'frontend' -%> [23:23:29] Listen 8141 [23:23:56] bd808: andrewbogott ^ 8140 is standard but we added balancing [23:24:10] so 8141 is when it's setup to do 2 things, like here: [23:24:27] https://wikitech.wikimedia.org/w/images/4/4c/Puppetquery.png [23:24:32] mutante: I think this class would be setting server_type = standalone [23:24:35] like palladium shows up twice there [23:24:51] * bd808 tries to confirm that [23:25:02] yea, 8140 would be the regular puppetmaster port [23:26:09] * bd808 is now lost in a maze of twisty little classes that YuviPanda hasn't hunted down and killed yet [23:27:15] bd808: i admit that was probably the worst timing to ask for it since he is in the middle of doing that :p [23:27:56] !log yurik@tin Synchronized php-1.27.0-wmf.9/extensions/Graph/: Deployed Graph ext to master - protocol issue (duration: 00m 32s) [23:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:08] sleep well folks [23:28:53] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:29:22] mutante: this ::puppet::self::* stuff is an island unto itself and seems not to mess with any port configuration or use the template you referenced. [23:29:31] (03Abandoned) 10Dzahn: labtest: don't send SMS for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259073 (https://phabricator.wikimedia.org/T120047) (owner: 10Dzahn) [23:29:44] so 8140 should cover it [23:30:05] bd808: yes, that makes sense, and it would have been surprising if it was different [23:30:21] since that is just for having the multi-master setup [23:30:25] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1886362 (10dr0ptp4kt) @ori, is that referring to the following? ``` dns/templates $ grep -ir 'LVS Misc' * ./0.6.... [23:30:40] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1886364 (10dr0ptp4kt) [23:32:02] bd808: i think it's good, just should be in a different location [23:32:10] because "Ensure that there are no firewall rules in modules" [23:32:18] !log ori@tin Synchronized php-1.27.0-wmf.8/includes/api/ApiStashEdit.php: local hack some extra debug logging into ApiStashEdit (duration: 00m 30s) [23:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:27] https://phabricator.wikimedia.org/T114209 [23:33:48] (03CR) 10Dzahn: "the rule and port look right, it should just be in a role class instead of here, due to https://phabricator.wikimedia.org/T114209" [puppet] - 10https://gerrit.wikimedia.org/r/259608 (owner: 10BryanDavis) [23:37:21] (03PS3) 10BryanDavis: Add ferm to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/259608 [23:37:56] !log ori@tin Synchronized php-1.27.0-wmf.8/includes/api/ApiStashEdit.php: local hack some extra debug logging into ApiStashEdit (take 2) (duration: 00m 30s) [23:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:09] (03CR) 10Andrew Bogott: [C: 031] Add ferm to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/259608 (owner: 10BryanDavis) [23:40:07] 6operations, 10MediaWiki-General-or-Unknown, 7Graphite, 5MW-1.27-release-notes, 5Patch-For-Review: mediawiki should send statsd metrics in batches - https://phabricator.wikimedia.org/T116031#1886377 (10hashar) jobrunner is a different system though: mediawiki/services/jobrunner.git [23:46:12] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:55:12] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:59:12] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]