[00:00:37] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1569714 (10Dzahn) down to 103051 and closing this for now [00:00:42] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: clean up mailman data directory (moderated messages > 0.5 million) - https://phabricator.wikimedia.org/T109838#1569715 (10Dzahn) 5Open>3Resolved [00:00:43] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1569716 (10Dzahn) [00:10:06] 6operations, 10Beta-Cluster, 7Database: Possible to run writes (e.g. UPDATE) on slave - https://phabricator.wikimedia.org/T110115#1569721 (10Mattflaschen) [00:14:40] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK Less than 1.00% above the threshold [1000000.0] [00:17:11] 6operations, 10Wikimedia-Mailing-lists, 15User-Bd808-Test: Close mwapi-team@lists.wikimedia.org list - https://phabricator.wikimedia.org/T97148#1569734 (10Dzahn) archives are at https://lists.wikimedia.org/mailman/private/mwapi-team.disabled.t97148/ [00:20:24] 6operations, 10Wikimedia-Mailing-lists: close and delete the flowfunding mailing list - https://phabricator.wikimedia.org/T97328#1569736 (10Dzahn) >>! In T97328#1551350, @Tbayer wrote: >> https://lists.wikimedia.org/pipermail/FlowFunding >> and >> https://lists.wikimedia.org/pipermail/flowfunding.disabled.t973... [00:21:36] (03CR) 10Alex Monk: "Asking more about this on T109687" [puppet] - 10https://gerrit.wikimedia.org/r/233638 (owner: 10Alex Monk) [00:22:11] 6operations, 10Wikimedia-Mailing-lists: rename lists mwapi-team.disabled.T97148 and flowfunding.disabled.T97328 ? - https://phabricator.wikimedia.org/T109539#1569750 (10Dzahn) moved to the same names with "t" instead of "T". listinfo pages work in both versions archives moved https://lists.wikimedia.org/... [00:22:20] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1569753 (10Dzahn) [00:23:22] 6operations, 10ops-codfw: mw2180 has a faulty disk - https://phabricator.wikimedia.org/T109687#1569755 (10Krenair) I noticed that this host prompts for a password to login. I'm not too familiar with the stages of setup, is there some extra step needed (puppet related?) before this machine can be put back into... [00:24:24] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1569756 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/233052/ https://gerrit.wikimedia.org/r/#/c/233050/ merged [00:25:22] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1569768 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/233642/ [00:25:43] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1569772 (10Dzahn) [00:25:45] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: service IP can't be switched over - https://phabricator.wikimedia.org/T108080#1569770 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:36:01] (03PS1) 10Ori.livneh: Lint: make ConfigDict methods lowerCamelCase [debs/pybal] - 10https://gerrit.wikimedia.org/r/233647 [00:42:50] (03PS5) 10Thcipriani: Add servicedeploy user; ssh-agent-proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [00:46:04] (03PS2) 10Ori.livneh: Lint: make ConfigDict methods lowerCamelCase [debs/pybal] - 10https://gerrit.wikimedia.org/r/233647 [00:46:06] (03PS1) 10Ori.livneh: Migrate get_subclasses to pybal.util; add unit tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/233649 [00:47:13] thcipriani: that looks awesome [00:48:31] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK Less than 1.00% above the threshold [1000000.0] [00:48:35] ori: nice! seems to work pretty well. please let me know if you have feedback :) [00:49:31] that patch is cherry-picked on the staging project in labs right now, FYI [00:49:53] * ori nods [00:50:17] out of curiosity, why 'ns_header' rather than 's_header'? (that is: what is the 'n' supposed to represent?) [00:51:15] oh, 'netstring' [00:51:44] indeed. [00:52:39] (03CR) 10GWicke: Add servicedeploy user; ssh-agent-proxy changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [00:54:31] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 9.09% of data above the critical threshold [500.0] [01:05:36] (03CR) 10Ori.livneh: Add servicedeploy user; ssh-agent-proxy changes (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [01:06:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:10:36] (03CR) 10Ori.livneh: [C: 032] Lint: make ConfigDict methods lowerCamelCase [debs/pybal] - 10https://gerrit.wikimedia.org/r/233647 (owner: 10Ori.livneh) [01:10:49] (03CR) 10Ori.livneh: [C: 032] Migrate get_subclasses to pybal.util; add unit tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/233649 (owner: 10Ori.livneh) [01:11:34] (03Merged) 10jenkins-bot: Lint: make ConfigDict methods lowerCamelCase [debs/pybal] - 10https://gerrit.wikimedia.org/r/233647 (owner: 10Ori.livneh) [01:11:36] (03Merged) 10jenkins-bot: Migrate get_subclasses to pybal.util; add unit tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/233649 (owner: 10Ori.livneh) [01:17:34] (03CR) 10Ori.livneh: Add servicedeploy user; ssh-agent-proxy changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [01:23:19] (03CR) 10Ori.livneh: "@Thcipriani: I recommend you split this patch into (at least) two separate patches. Make the changes to ssh-agent-proxy / keyholder first," [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [01:32:11] PROBLEM - Disk space on mw1142 is CRITICAL: DISK CRITICAL - free space: / 8178 MB (3% inode=93%) [01:43:57] !log restarting kafka on new brokers kafka1013,1014,1020 to apply increase in num.replica.fetchers [01:43:58] to make new brokers copy new data faster [01:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:47] !log starting move of kafka partitions for topic webrequest_upload to new brokers. this will take a while! [01:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:01:27] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1569904 (10Mholloway) I managed to create a new Phab user account associated with my WMF wiki account, then about 30 seconds later find the setting I should have used to associate... [02:05:02] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:15:00] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 13.33% of data above the critical threshold [500.0] [02:21:37] !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 06m 26s) [02:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:42] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:44:19] (03PS2) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 [02:44:35] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [02:45:01] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [02:56:13] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 501 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1569938 (10Tgr) [[ https://github.com/wikimedia/mediawiki/blob/master/thumb.php | `thumb.php` ]] does not ever return 501 as far as I can see... [02:58:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [03:05:52] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:07:10] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:09:00] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10847 bytes in 0.205 second response time [03:12:30] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:21:37] 6operations, 7HTTPS: download.wiki[mp]edia.org are using an invalid certificate - https://phabricator.wikimedia.org/T107575#1569964 (10Chmarkine) [03:39:29] (03PS1) 10Chad: Point download.wiki(m|p)edia.org at text-lb [dns] - 10https://gerrit.wikimedia.org/r/233659 (https://phabricator.wikimedia.org/T107575) [03:39:31] (03PS1) 10Chad: Rewrite download.wiki(p|m)edia.org urls to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/233658 (https://phabricator.wikimedia.org/T107575) [03:45:31] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [03:57:50] PROBLEM - Disk space on rdb1003 is CRITICAL: DISK CRITICAL - free space: / 8532 MB (3% inode=99%) [03:58:02] 6operations, 10Traffic, 7HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1569992 (10MZMcBride) >>! In T103919#1402421, @Dzahn wrote: > left after removing svn and dev: > > git.wikimedia.org > graphite.wikimedia.org > releases.wikimedia.org > graf... [04:12:20] PROBLEM - puppet last run on mw2025 is CRITICAL puppet fail [04:12:51] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:12:51] (03PS1) 10MZMcBride: Remove auto-redirection from 404 page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) [04:12:53] (03CR) 10jenkins-bot: [V: 04-1] Remove auto-redirection from 404 page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [04:19:00] PROBLEM - puppet last run on db2045 is CRITICAL Puppet has 1 failures [04:19:30] PROBLEM - Redis on rdb1003 is CRITICAL: Connection timed out [04:25:10] (03PS1) 10Legoktm: Use CodeEditor for HTML templates on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T105625) [04:26:18] (03PS2) 10Legoktm: Use CodeEditor for HTML templates on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) [04:26:23] (03CR) 10jenkins-bot: [V: 04-1] Use CodeEditor for HTML templates on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [04:27:52] (03CR) 10Legoktm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [04:39:01] RECOVERY - Redis on rdb1003 is OK: TCP OK - 3.008 second response time on port 6379 [04:41:01] (03CR) 10MZMcBride: Use CodeEditor for HTML templates on Meta-Wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [04:41:31] RECOVERY - puppet last run on mw2025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:42:52] (03PS3) 10Legoktm: Use CodeEditor for HTML templates on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) [04:43:03] (03CR) 10Legoktm: Use CodeEditor for HTML templates on Meta-Wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [04:46:00] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [04:46:12] RECOVERY - puppet last run on db2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:48:50] PROBLEM - Redis on rdb1003 is CRITICAL: Connection timed out [04:54:30] RECOVERY - Redis on rdb1003 is OK: TCP OK - 3.002 second response time on port 6379 [04:59:55] 6operations, 10Traffic, 7HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1570062 (10Chmarkine) According to DNS, download.wikimedia.org and gerrit.wikimedia.org are not behind misc-web. Why are these two domains in misc.inc.vcl.erb? [05:06:30] PROBLEM - Redis on rdb1003 is CRITICAL: Connection timed out [05:08:21] RECOVERY - Redis on rdb1003 is OK: TCP OK - 3.002 second response time on port 6379 [05:22:00] PROBLEM - Redis on rdb1003 is CRITICAL: Connection timed out [05:23:22] 6operations, 10Wikimedia-Mailing-lists: close and delete the flowfunding mailing list - https://phabricator.wikimedia.org/T97328#1570071 (10Tbayer) >>! In T97328#1569736, @Dzahn wrote: >>>! In T97328#1551350, @Tbayer wrote: >>> https://lists.wikimedia.org/pipermail/FlowFunding >>> and >>> https://lists.wikimed... [05:27:41] RECOVERY - Redis on rdb1003 is OK: TCP OK - 0.999 second response time on port 6379 [05:33:50] PROBLEM - Redis on rdb1003 is CRITICAL: Connection timed out [05:42:23] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:43:37] (03CR) 10Muehlenhoff: [C: 031] "I've never run into the masking bug behaviour with /etc/systemd/system myself, but that makes sense; /lib/systemd/system is the path for d" [puppet] - 10https://gerrit.wikimedia.org/r/233626 (owner: 10Alexandros Kosiaris) [05:44:34] <_joe_> what's up with redis? [05:45:12] RECOVERY - Redis on rdb1003 is OK: TCP OK - 0.003 second response time on port 6379 [05:45:21] <_joe_> uhm [05:51:10] PROBLEM - Redis on rdb1003 is CRITICAL: Connection timed out [05:58:51] RECOVERY - Redis on rdb1003 is OK: TCP OK - 0.003 second response time on port 6379 [06:05:30] (03PS1) 10Muehlenhoff: Enable ferm on db1048.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/233670 [06:14:14] (03PS1) 10Muehlenhoff: Enable ferm on remaining phabricator db hosts [puppet] - 10https://gerrit.wikimedia.org/r/233671 [06:22:59] (03PS1) 10Ori.livneh: Initial commit of ConfigurationObserver unit tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/233672 [06:24:31] PROBLEM - Redis on rdb1003 is CRITICAL: Connection timed out [06:29:52] PROBLEM - puppet last run on eventlog2001 is CRITICAL puppet fail [06:30:31] PROBLEM - puppet last run on mw2077 is CRITICAL puppet fail [06:30:51] PROBLEM - puppet last run on db2055 is CRITICAL puppet fail [06:31:20] PROBLEM - puppet last run on cp3048 is CRITICAL puppet fail [06:31:31] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:32:31] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:33:01] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:33:01] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 2 failures [06:33:21] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:33:31] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:33:31] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:36:00] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL - Socket timeout after 10 seconds [06:37:51] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.004 second response time [06:38:11] PROBLEM - Disk space on rdb1004 is CRITICAL: DISK CRITICAL - free space: / 197 MB (0% inode=99%) [06:38:35] !log performing schema change on officewiki, mediawikiwiki and metawiki [06:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:40:11] RECOVERY - Disk space on rdb1004 is OK: DISK OK [06:42:20] RECOVERY - Redis on rdb1003 is OK: TCP OK - 0.002 second response time on port 6379 [06:48:20] PROBLEM - Redis on rdb1003 is CRITICAL: Connection timed out [06:49:31] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 {channel:frontend.error,request:{id:1440485364758-38428},error:{message:Status check failed (redis failure?)}} - 232 bytes in 0.014 second response time [06:56:30] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:30] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:41] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:57:00] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:58:00] RECOVERY - puppet last run on mw2077 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:58:11] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:20] PROBLEM - puppet last run on cp4009 is CRITICAL puppet fail [06:58:52] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:41] PROBLEM - Disk space on rdb1002 is CRITICAL: DISK CRITICAL - free space: /a 8824 MB (3% inode=99%) [07:03:50] RECOVERY - Redis on rdb1003 is OK: TCP OK - 0.009 second response time on port 6379 [07:11:58] <_joe_> !log stopping redis on rdb1003,4, wiping AOF, restarting [07:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:12:30] PROBLEM - Disk space on rdb1001 is CRITICAL: DISK CRITICAL - free space: /a 8070 MB (2% inode=99%) [07:14:42] RECOVERY - Disk space on rdb1003 is OK: DISK OK [07:23:00] PROBLEM - Disk space on rdb1002 is CRITICAL: DISK CRITICAL - free space: /a 9336 MB (3% inode=99%) [07:24:30] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.008 second response time [07:25:32] RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:33:10] PROBLEM - Disk space on rdb1002 is CRITICAL: DISK CRITICAL - free space: /a 1747 MB (0% inode=99%) [07:35:01] PROBLEM - Redis on rdb1001 is CRITICAL: Connection refused [07:35:02] RECOVERY - Disk space on rdb1002 is OK: DISK OK [07:35:13] <_joe_> !log stopping redis, wiping aof, restarting redis on rdb100{1,2} - snapshot saved on rdb1002:/root [07:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:36:11] RECOVERY - Disk space on rdb1001 is OK: DISK OK [07:37:01] RECOVERY - Redis on rdb1001 is OK: TCP OK - 0.001 second response time on port 6379 [07:44:51] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [07:59:20] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 501 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1570340 (10Joe) @tgr yes this problem is at the varnish level. [08:12:31] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:20:01] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 17003 msg: ocg_render_job_queue 4931 msg (=3000 critical) [08:20:41] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 18184 msg: ocg_render_job_queue 5458 msg (=3000 critical) [08:21:31] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 19046 msg: ocg_render_job_queue 5677 msg (=3000 critical) [08:22:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [08:22:32] (03PS1) 10Muehlenhoff: Additional fixes from initial review [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/233678 [08:23:39] (03PS1) 10Muehlenhoff: Bump version for rebuild [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/233679 [08:24:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Additional fixes from initial review [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/233678 (owner: 10Muehlenhoff) [08:24:45] <_joe_> dcausse: around? [08:24:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bump version for rebuild [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/233679 (owner: 10Muehlenhoff) [08:24:58] _joe_: yes [08:25:17] <_joe_> dcausse: did you guys launch some large reindexing jobs yesterday for elasticsearch? [08:25:44] _joe_: I guess Chase and Erik did some tests [08:26:09] <_joe_> oh well, their tests went very good, we had a full blown outage :) [08:26:22] damn [08:26:58] <_joe_> also, if that was !log-ged I would've had an easier life purging those [08:27:24] maybe it's not related to their, what's happened? [08:27:33] s/to their/to their test/ [08:28:19] <_joe_> no I'm pretty sure it is [08:28:38] <_joe_> we had a ginormous surge in the amount of jobs the jobqueue were processing [08:28:56] <_joe_> a number much larger than what the jobqueue could process [08:29:02] <_joe_> so the data piled up in redis [08:29:09] <_joe_> and redis eventually died badly [08:29:11] oh ok [08:29:35] index was frozen and maybe forgot to unfreeze [08:30:18] another check to monitor? [08:30:40] I guess we need to check the cirrus queue size [08:30:45] <_joe_> dcausse: can you check on your side? we're killing those jobs I think [08:31:24] <_joe_> http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1440490967.532&from=-1day&target=mostDeviant(4%2CMediaWiki.jobqueue.inserts.*.rate) [08:32:46] mhh doesn't look like we're out of the woods yet, those spikes weren't there before [08:33:15] <_joe_> godog: we must understand how to wipe those data out I guess [08:33:33] <_joe_> I am writing the incident report, can you guys handle this please? [08:34:08] godog: those sawlike spikes could be half working reporting [08:34:26] the first 5 at least [08:34:47] the 2-3 last ones that are way more erratic worry me more [08:35:44] akosiaris: yeah you are right, the regular spikes are from when redis was down [08:39:45] looking at recent commits there's https://gerrit.wikimedia.org/r/#/c/232868/ which *might* be related, what do you think dcausse ? [08:40:19] if it has been already deployed that is [08:41:46] godog: i think it's not [08:41:58] godog: maybe... but the queue started to grow at 18h yesterday [08:42:41] it's merged in master but not in any wmf/1.2something branches [08:43:00] yeah that's true, nevermind not related [08:43:31] PROBLEM - Check size of conntrack table on silver is CRITICAL nf_conntrack is 100 % full [08:43:50] really ? silver ? [08:45:31] RECOVERY - Check size of conntrack table on silver is OK nf_conntrack is 4 % full [08:45:34] and I was wrong btw, it is deployed [08:45:51] wat? from 100 => 4 ? [08:51:40] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1570391 (10hashar) Seems some disk space has been reclaimed on labvirt1007. I managed to boot the three CI instances that were paused (T110052). [08:57:42] at 8:40:00 /usr/local/bin/mwscript maintenance/runJObs.pgp --wiki=labswiki started and the nf_conntrack table started to overflow at 8:41:02, so that's not a false positive [08:58:57] moritzm: interesting that jobs caused that while the jobqueue is having problems [08:59:15] also happened yesterday at 18:59, but not for earlier jobs [08:59:23] I have no idea if wikitech shares infrastructure though with the rest of the infra [09:02:06] <_joe_> it does [09:02:09] <_joe_> it's in the train now [09:02:24] <_joe_> a brilliant choice I opposed to a lot [09:02:38] <_joe_> but clearly laziness >>> correctness in our team [09:03:11] <_joe_> "it's easier to keep it in sync" vs "any problem to the rest of the infra is going to cripple wikitech as well" [09:03:24] <_joe_> with the justification being "we have wikitech-static anyways" [09:04:39] _joe_: well, it more finegrained than that [09:04:47] it has it's own memcached for example [09:05:12] but not it's own redis [09:05:22] it's worse than what I thought [09:06:09] silver had some firewall issues last week [09:06:56] 6operations, 10Wikimedia-Mailing-lists: send follow-up email, announce changes with new mailman version if any that have user impact - https://phabricator.wikimedia.org/T110140#1570434 (10JohnLewis) Will review user impacting changes via the change logs but I dobt think there are really. [09:07:37] so it looks indices were frozen yesterday at 17h58: jobs started to be rejected and delayed, I ran the unfreeze command: jobs are being accepted again [09:07:59] <_joe_> ahahahahah [09:08:09] <_joe_> dcausse: who freezed the indices? [09:09:17] Chase I think, I'll check with this afternoon [09:09:27] <_joe_> ok [09:09:48] <_joe_> dcausse: I'll need to amend the incident docs, but great catch [09:10:12] dcausse: thanks! [09:10:17] nice going [09:10:42] <_joe_> dcausse: yes, thanks a lot :) [09:10:46] indeed thanks dcausse ! [09:11:28] this freezeIndex command is nice but we should make sure that indices are frozen more than 2hours [09:11:37] s/are/are [09:11:40] s/are/are *not*/ [09:12:03] <_joe_> dcausse: yeah another actionable for my documentation :) [09:12:38] 6operations, 7Monitoring: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169#1570438 (10Joe) 3NEW [09:14:32] and I'm afraid we've lost some updates :( we'll have to reindex some docs manually (all updates from yesterday 18h to now) [09:16:00] <_joe_> dcausse: when did you exactly restart indexing? [09:16:21] <_joe_> at 09:07 UTC? [09:17:06] (03CR) 10Filippo Giunchedi: [C: 04-1] Assign swift roles via ENC (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/200625 (https://phabricator.wikimedia.org/T91553) (owner: 10Thcipriani) [09:17:14] 8h37 UTC [09:17:16] 2015-08-25 08:37:48 mw1012 frwiki CirrusSearch DEBUG: Allowed writes to frwiki_general [09:17:38] <_joe_> dcausse: ok thanks [09:17:54] <_joe_> dcausse: when you do such things, !log them [09:18:50] _joe_: ok [09:19:19] 6operations, 7Monitoring: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#1570456 (10Joe) 3NEW [09:20:41] (03CR) 10Filippo Giunchedi: [C: 031] Add hiera data for swift proxies and backends [puppet] - 10https://gerrit.wikimedia.org/r/233443 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [09:21:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Require openjdk-8-jdk [puppet] - 10https://gerrit.wikimedia.org/r/222037 (owner: 10GWicke) [09:23:33] (03PS2) 10Muehlenhoff: Add hiera data for swift proxies and backends [puppet] - 10https://gerrit.wikimedia.org/r/233443 (https://phabricator.wikimedia.org/T104965) [09:23:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add hiera data for swift proxies and backends [puppet] - 10https://gerrit.wikimedia.org/r/233443 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [09:32:09] (03PS1) 10Muehlenhoff: Bump version in changelog [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/233682 [09:34:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bump version in changelog [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/233682 (owner: 10Muehlenhoff) [09:39:43] ACKNOWLEDGEMENT - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied Yuvi Panda False alarm, its a snapshot that our nrpe user cant access [09:46:51] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [09:49:42] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1570519 (10fgiunchedi) 5Open>3Resolved complete `/dev/sdf1 1.9T 1.1T 827G 56% /srv/swift-storage/sdf1` [09:56:05] (03PS1) 10ArielGlenn: dumps: job 'createdir' which just creates new dir for dump run [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233683 [09:59:20] !log uploaded debdeploy 0.0.2-2 for precise/trusty/jessie to carbon [09:59:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, but we should also remove the existing units for now, right?" [puppet] - 10https://gerrit.wikimedia.org/r/233626 (owner: 10Alexandros Kosiaris) [10:00:04] 6operations, 10OCG-General-or-Unknown: Ferm rules for ocg hosts - https://phabricator.wikimedia.org/T104976#1570549 (10fgiunchedi) 5Resolved>3Open reopening as we should be `NOTRACK`ing connections made to port 8000 on the ocg service ip, at the moment there's been a lot of jobs enqueued to ocg and the con... [10:00:58] (03PS2) 10ArielGlenn: dumps: job 'createdir' which just creates new dir for dump run [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233683 [10:02:16] (03PS2) 10Yuvipanda: Labs: Allow per-host Hiera overrides via wikitech [puppet] - 10https://gerrit.wikimedia.org/r/233184 (https://phabricator.wikimedia.org/T104202) (owner: 10Tim Landscheidt) [10:02:27] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: job 'createdir' which just creates new dir for dump run [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233683 (owner: 10ArielGlenn) [10:02:29] (03CR) 10Yuvipanda: [C: 032 V: 032] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/233184 (https://phabricator.wikimedia.org/T104202) (owner: 10Tim Landscheidt) [10:05:39] 10.20 < icinga-wm> PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 17003 msg: ocg_render_job_queue 4931 msg (=3000 critical) [10:05:41] !log restart puppetmaster on labcontrol1001 for https://gerrit.wikimedia.org/r/#/c/233184/ [10:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:47] AFAICS this was never followed by a RECOVERY [10:10:51] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:11:07] <_joe_> Nemo_bis: what wasn't? [10:11:50] Nemo_bis: yes, it has not recovered yet. it is processing a backlog [10:12:17] (03CR) 10Hashar: [C: 032] "I am not maintaining that repository and I don't think jenkins is able to submit patch there." [dumps] - 10https://gerrit.wikimedia.org/r/207699 (owner: 10Dereckson) [10:13:07] (03CR) 10Alexandros Kosiaris: "indeed. A temporary file resouce ensuring absent should be enough I think. I 'll amend" [puppet] - 10https://gerrit.wikimedia.org/r/233626 (owner: 10Alexandros Kosiaris) [10:13:31] (03CR) 10Hashar: "Yup Jenkins can't merge, so Ariel needs to submit the patch." [dumps] - 10https://gerrit.wikimedia.org/r/207699 (owner: 10Dereckson) [10:16:11] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service, 5Patch-For-Review: Replace dbrant with mholloway for MobileApps production access - https://phabricator.wikimedia.org/T109857#1570586 (10akosiaris) Approved in T109855. merging change [10:16:40] (03PS3) 10Alexandros Kosiaris: Replace dbrant with mholloway for mobileapps prod access [puppet] - 10https://gerrit.wikimedia.org/r/233364 (https://phabricator.wikimedia.org/T109857) (owner: 10Muehlenhoff) [10:16:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Replace dbrant with mholloway for mobileapps prod access [puppet] - 10https://gerrit.wikimedia.org/r/233364 (https://phabricator.wikimedia.org/T109857) (owner: 10Muehlenhoff) [10:16:55] (03PS4) 10Alexandros Kosiaris: Replace dbrant with mholloway for mobileapps prod access [puppet] - 10https://gerrit.wikimedia.org/r/233364 (https://phabricator.wikimedia.org/T109857) (owner: 10Muehlenhoff) [10:17:00] (03CR) 10Alexandros Kosiaris: [V: 032] Replace dbrant with mholloway for mobileapps prod access [puppet] - 10https://gerrit.wikimedia.org/r/233364 (https://phabricator.wikimedia.org/T109857) (owner: 10Muehlenhoff) [10:17:40] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service, 5Patch-For-Review: Replace dbrant with mholloway for MobileApps production access - https://phabricator.wikimedia.org/T109857#1570592 (10akosiaris) 5Open>3Resolved Resolving [10:18:49] (03PS1) 10Muehlenhoff: Exempt ocg service from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/233684 (https://phabricator.wikimedia.org/T104976) [10:21:28] (03PS1) 10Alexandros Kosiaris: Grant access to tin to bsitzmann and mholloway [puppet] - 10https://gerrit.wikimedia.org/r/233685 (https://phabricator.wikimedia.org/T109855) [10:23:29] akosiaris: and how long is it supposed to take? My tab has been refreshing for 20 minutes now :) [10:23:43] (03CR) 10Alexandros Kosiaris: [C: 032] Grant access to tin to bsitzmann and mholloway [puppet] - 10https://gerrit.wikimedia.org/r/233685 (https://phabricator.wikimedia.org/T109855) (owner: 10Alexandros Kosiaris) [10:23:57] (Most people give up after a few seconds, do we drop those jobs?) [10:23:57] Nemo_bis: a long time... [10:24:11] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [10:24:11] (03CR) 10Filippo Giunchedi: [C: 031] Exempt ocg service from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/233684 (https://phabricator.wikimedia.org/T104976) (owner: 10Muehlenhoff) [10:25:03] Nemo_bis: it's catching up with queue problems experienced for some 2 hours, it's gonna take a while [10:25:41] Ok, less than two hours it seems https://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&c=PDF+servers+eqiad&h=ocg1001.eqiad.wmnet&jr=&js=&event=hide&ts=0&v=58208&m=ocg_job_queue&vl=messages&ti=ocg_job_queue [10:26:06] I was already fearing weeks or months ;) [10:28:02] PROBLEM - Check size of conntrack table on ocg1001 is CRITICAL nf_conntrack is 94 % full [10:28:15] Nemo_bis: my projection is closer to 3 but I could be wrong [10:32:09] 10Ops-Access-Requests, 6operations, 3Mobile-Content-Service, 5Patch-For-Review: Add bsitzmann and mholloway as deployers for the MobileApps service - https://phabricator.wikimedia.org/T109855#1570607 (10akosiaris) 5Open>3Resolved Change merged, accounts created, resolving [10:33:31] (03PS2) 10Muehlenhoff: Exempt ocg service from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/233684 (https://phabricator.wikimedia.org/T104976) [10:33:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Exempt ocg service from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/233684 (https://phabricator.wikimedia.org/T104976) (owner: 10Muehlenhoff) [10:35:47] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1570612 (10akosiaris) @chasemp, @mmodell, any pointers on how to resolve the above ? Thanks! [10:36:11] RECOVERY - Check size of conntrack table on ocg1001 is OK nf_conntrack is 75 % full [10:46:38] !log dropping old tables on s2 - T54932 [10:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:49] * jynus opens 40 different monitoring systems [10:48:26] <_joe_> just 40? [10:49:08] yes, only the essential ones [10:49:15] <_joe_> Nemo_bis: let me know when your book is ready :) [10:56:20] (03PS1) 10Muehlenhoff: Add ferm rules for swift storage backends [puppet] - 10https://gerrit.wikimedia.org/r/233686 (https://phabricator.wikimedia.org/T104965) [11:01:49] (03PS1) 10Muehlenhoff: Add ferm rules for swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/233687 (https://phabricator.wikimedia.org/T104965) [11:01:55] 6operations, 10OCG-General-or-Unknown, 5Patch-For-Review: Ferm rules for ocg hosts - https://phabricator.wikimedia.org/T104976#1570663 (10MoritzMuehlenhoff) 5Open>3Resolved Closing again. [11:03:46] (03Abandoned) 10Muehlenhoff: Add ferm rules for swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/223537 (owner: 10Muehlenhoff) [11:04:03] (03Abandoned) 10Muehlenhoff: Add ferm rules for swift backends [puppet] - 10https://gerrit.wikimedia.org/r/224071 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [11:16:41] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [11:16:44] !log dropping old tables on s3 - T54932 [11:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:18:17] lag is surprisingly good, I though we were going to have buffer pool contention [11:19:19] but makes sense- after all they are completely unused tables [11:27:21] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1570718 (10Aklapper) >>! In T110064#1569540, @Dzahn wrote: > Could you login on phabricator with your (WMF) wiki account? On first login that will be associated with this Phabricat... [11:37:46] (03CR) 10Glaisher: [C: 031] "Nice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [11:42:42] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:53:27] _joe_: I gave up in favour of action=print loooooooong ago [11:55:28] but the queue is almost dry [11:56:51] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 184346 msg: ocg_render_job_queue 466 msg [11:57:10] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 184383 msg: ocg_render_job_queue 337 msg [11:58:00] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 184491 msg: ocg_render_job_queue 0 msg [12:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150825T1200). [12:00:51] (03PS2) 10ArielGlenn: dumps: redo handling of jobs with unrun prereqs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233417 [12:06:11] (03PS1) 10Alexandros Kosiaris: DHCP/puppet changes for fermium's public IP [puppet] - 10https://gerrit.wikimedia.org/r/233690 (https://phabricator.wikimedia.org/T109923) [12:07:23] (03PS3) 10Alexandros Kosiaris: Assign fermium public IPs. IPv4 and IPv6 [dns] - 10https://gerrit.wikimedia.org/r/233414 (https://phabricator.wikimedia.org/T109923) [12:07:35] (03PS2) 10Alexandros Kosiaris: DHCP/puppet changes for fermium's public IP [puppet] - 10https://gerrit.wikimedia.org/r/233690 (https://phabricator.wikimedia.org/T109923) [12:07:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] DHCP/puppet changes for fermium's public IP [puppet] - 10https://gerrit.wikimedia.org/r/233690 (https://phabricator.wikimedia.org/T109923) (owner: 10Alexandros Kosiaris) [12:10:02] (03CR) 10Alexandros Kosiaris: [C: 032] Assign fermium public IPs. IPv4 and IPv6 [dns] - 10https://gerrit.wikimedia.org/r/233414 (https://phabricator.wikimedia.org/T109923) (owner: 10Alexandros Kosiaris) [12:11:40] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1570779 (10akosiaris) [12:11:44] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: add public IP for fermium - DNS and DHCP change for reinstall - https://phabricator.wikimedia.org/T109923#1570777 (10akosiaris) 5Open>3Resolved [12:22:28] (03PS2) 10Alexandros Kosiaris: Add Stas to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233375 (https://phabricator.wikimedia.org/T109357) (owner: 10Muehlenhoff) [12:22:35] (03PS3) 10Alexandros Kosiaris: Add Stas to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233375 (https://phabricator.wikimedia.org/T109357) (owner: 10Muehlenhoff) [12:22:59] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add Stas to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233375 (https://phabricator.wikimedia.org/T109357) (owner: 10Muehlenhoff) [12:23:56] (03PS2) 10Alexandros Kosiaris: Add Erik Bernhardson to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233376 (https://phabricator.wikimedia.org/T109356) (owner: 10Muehlenhoff) [12:24:02] (03CR) 10jenkins-bot: [V: 04-1] Add Erik Bernhardson to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233376 (https://phabricator.wikimedia.org/T109356) (owner: 10Muehlenhoff) [12:26:07] (03PS3) 10ArielGlenn: dumps: redo handling of jobs with unrun prereqs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233417 [12:28:57] (03CR) 10ArielGlenn: [C: 032] dumps: redo handling of jobs with unrun prereqs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233417 (owner: 10ArielGlenn) [12:29:07] (03CR) 10ArielGlenn: [V: 032] dumps: redo handling of jobs with unrun prereqs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/233417 (owner: 10ArielGlenn) [12:30:25] (03CR) 10Filippo Giunchedi: [C: 04-1] Add ferm rules for swift storage backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233686 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [12:32:44] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for swift proxies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233687 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [12:33:50] (03PS3) 10Alexandros Kosiaris: Add Erik Bernhardson to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233376 (https://phabricator.wikimedia.org/T109356) (owner: 10Muehlenhoff) [12:37:08] (03CR) 10Alexandros Kosiaris: [C: 032] Add Erik Bernhardson to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/233376 (https://phabricator.wikimedia.org/T109356) (owner: 10Muehlenhoff) [12:37:17] (03PS2) 10ArielGlenn: dumps: tweak stages a bit [puppet] - 10https://gerrit.wikimedia.org/r/233418 [12:40:40] 6operations, 7Database: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#1570840 (10akosiaris) p:5Triage>3Normal We already recently tried that for etherpad with @jcrespo. It failed due to db1016 not having the same rights as db1001. On the next effort everything went smoo... [12:41:17] 6operations, 7discovery-system: Create a conftool "agent" that overcomes confd deficiencies - https://phabricator.wikimedia.org/T107285#1570842 (10akosiaris) p:5Triage>3Low [12:42:42] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1570846 (10akosiaris) @yuvipanda: ping ? I am free to help this week [12:43:18] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1570847 (10akosiaris) p:5Triage>3Low [12:44:28] 6operations, 7Monitoring: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169#1570848 (10akosiaris) p:5Triage>3High [12:44:37] 6operations, 7Monitoring: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#1570850 (10akosiaris) p:5Triage>3High [12:47:12] (03PS2) 10Muehlenhoff: Add ferm rules for swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/233687 (https://phabricator.wikimedia.org/T104965) [12:48:37] (03PS1) 10Jgreen: change IP for bismuth [dns] - 10https://gerrit.wikimedia.org/r/233693 [12:48:40] 6operations, 6Discovery, 7Elasticsearch: Update Elasticsearch for missing updates from outage on 20150825 - https://phabricator.wikimedia.org/T110179#1570857 (10chasemp) 3NEW [12:50:46] 6operations, 10Adminbot: Upload new release of adminbot for Trusty - https://phabricator.wikimedia.org/T109947#1570867 (10akosiaris) a:3Dzahn Assigning to @DZahn since he did the last update [12:50:54] 6operations, 10Adminbot: Upload new release of adminbot for Trusty - https://phabricator.wikimedia.org/T109947#1570869 (10akosiaris) p:5Triage>3Low [12:52:16] (03CR) 10Jgreen: [C: 032 V: 031] change IP for bismuth [dns] - 10https://gerrit.wikimedia.org/r/233693 (owner: 10Jgreen) [12:52:22] (03CR) 10Filippo Giunchedi: [C: 031] "I recommend stopping puppet on ms-fe1 at least before merge/apply and roll out incrementally, ditto for ms-be" [puppet] - 10https://gerrit.wikimedia.org/r/233687 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [12:53:03] !log authdns-update to change bismuth's IP [12:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:51] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1570888 (10akosiaris) I am not sure which project to associate with this task. Seems like the big bucket operations project was (well AFAIC, so that it is not forgotten) chosen but we probably need to move it c... [12:56:01] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1570889 (10akosiaris) p:5Triage>3Normal [12:56:05] 6operations, 7Database: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#1570890 (10jcrespo) Just a side note: the issues we experience probably were caused by actually overpassing the proxy, and doing the failover "manually", so that it only affected etherpad, and not the res... [12:56:10] 6operations, 5 Incident-20140423-Redis, 5Patch-For-Review: Enable memory overcommit for all redis hosts with persistance - https://phabricator.wikimedia.org/T91498#1570891 (10chasemp) [12:56:17] 6operations, 5 Incident-20140423-Redis, 7Monitoring: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169#1570894 (10chasemp) [12:56:23] 6operations, 5 Incident-20140423-Redis, 7Monitoring: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#1570896 (10chasemp) [12:56:29] 6operations, 5 Incident-20140423-Redis, 6Discovery, 7Elasticsearch: Update Elasticsearch for missing updates from outage on 20150825 - https://phabricator.wikimedia.org/T110179#1570898 (10chasemp) [13:00:47] (03CR) 10Alex Monk: [C: 031] "Ignore Jenkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [13:01:58] 6operations, 6Services, 5Patch-For-Review: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1570911 (10akosiaris) a:3akosiaris [13:02:08] 6operations, 6Services, 5Patch-For-Review: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1506785 (10akosiaris) p:5Triage>3Normal [13:02:30] 6operations, 7Database, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1570914 (10akosiaris) p:5Triage>3Normal [13:03:33] 6operations, 7Database, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1570916 (10akosiaris) a:3jcrespo Assigning to @jcrespo as the man for the job [13:03:40] 6operations, 5 Incident-20140423-Redis, 6Discovery, 7Elasticsearch: Update Elasticsearch for missing updates from outage on 20150825 - https://phabricator.wikimedia.org/T110179#1570918 (10dcausse) We could run: ``` mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --from <2015... [13:04:26] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#1570921 (10akosiaris) a:3jcrespo [13:04:29] 6operations, 7Database, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1570923 (10jcrespo) [13:04:33] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#1491423 (10akosiaris) p:5Triage>3Normal [13:05:48] 6operations, 5 Incident-20140423-Redis, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Update Elasticsearch for missing updates from outage on 20150825 - https://phabricator.wikimedia.org/T110179#1570928 (10dcausse) a:3dcausse [13:06:49] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1570937 (10akosiaris) p:5Triage>3Normal [13:09:00] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#1570943 (10jcrespo) p:5Normal>3Low Low: springle disagrees (with a proper reason). S3 issues are mostly gone, and probably new hardware w... [13:14:52] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [13:16:29] (03CR) 10Glaisher: [C: 031] Remove auto-redirection from 404 page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [13:16:40] (03CR) 10Glaisher: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [13:17:56] (03CR) 10Alex Monk: [C: 031] Use CodeEditor for HTML templates on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [13:21:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant SMalyshev access to stat1002 to query hive - https://phabricator.wikimedia.org/T109357#1570976 (10akosiaris) 5Open>3Resolved [13:21:40] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant ebernhardson access to stat1002 to query hive - https://phabricator.wikimedia.org/T109356#1570979 (10akosiaris) 5Open>3Resolved [13:22:42] 10Ops-Access-Requests, 6operations: Add Matanya to "restricted" to perform server side uploads - https://phabricator.wikimedia.org/T106447#1570985 (10akosiaris) @mark, ping ? [13:26:07] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1571000 (10yuvipanda) So, the ultimate goal is to have: # Real hardware that we can give arbitrary people root to... # which is in the same network as labs instances, so it... [13:26:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1571002 (10akosiaris) p:5Triage>3High [13:27:02] RECOVERY - Disk space on mw1142 is OK: DISK OK [13:28:05] (03CR) 10Muehlenhoff: Add ferm rules for swift storage backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233686 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [13:34:54] (03CR) 10Merlijn van Deen: "The correct fix is removing the package from the base image, rather than force-removing it with a puppet manifest." [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [13:36:24] (03PS2) 10Yuvipanda: quarry: Remove duplication of clone_path and other variables [puppet] - 10https://gerrit.wikimedia.org/r/231759 [13:38:28] (03CR) 10Yuvipanda: [C: 032] quarry: Remove duplication of clone_path and other variables [puppet] - 10https://gerrit.wikimedia.org/r/231759 (owner: 10Yuvipanda) [13:39:51] (03PS1) 10Giuseppe Lavagetto: nutcracker: reduce logs verbosity on all the high-traffic clusters [puppet] - 10https://gerrit.wikimedia.org/r/233704 [13:40:51] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:47:35] (03CR) 10Yuvipanda: [C: 032] ores: Add role+class for the precached daemon [puppet] - 10https://gerrit.wikimedia.org/r/231760 (owner: 10Yuvipanda) [13:47:47] (03PS2) 10Yuvipanda: ores: Mark all roles requiring ores::base properly [puppet] - 10https://gerrit.wikimedia.org/r/231761 [13:47:58] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Mark all roles requiring ores::base properly [puppet] - 10https://gerrit.wikimedia.org/r/231761 (owner: 10Yuvipanda) [13:48:27] !log dropping old tables on s6 - T54932 [13:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:49:33] (03CR) 10Giuseppe Lavagetto: "@Merlijn:" [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [13:50:33] <_joe_> valhallasw`cloud: I think we can discuss this off-gerrit maybe :) [13:50:56] (03PS2) 10Giuseppe Lavagetto: nutcracker: reduce logs verbosity on all the high-traffic clusters [puppet] - 10https://gerrit.wikimedia.org/r/233704 [13:51:58] (03PS4) 10Muehlenhoff: Add ferm rules for Logstash/Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) [13:52:21] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: reduce logs verbosity on all the high-traffic clusters [puppet] - 10https://gerrit.wikimedia.org/r/233704 (owner: 10Giuseppe Lavagetto) [13:52:53] (03PS5) 10Muehlenhoff: Add ferm rules for Logstash/Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) [13:53:33] _joe_: > this basically blocks any other similarly trivial change one could make. <-- maybe, but then my question is 'why do these changes have to be self-merged within a few minutes'. The self-merging prevents anyone from even asking the question 'what will be the impact of this on labs hosts' [13:53:45] in this specific change, the change is also *really* hard to revert [13:53:45] <_joe_> valhallasw`cloud: sorry, 1 sec [13:53:50] <_joe_> I screwed up apparently [13:53:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] "PS4 was a manual rebase due to gerrit being obtuse" [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [13:54:04] I'll just type, feel free to respond later [13:54:31] (03PS1) 10Giuseppe Lavagetto: Revert "nutcracker: reduce logs verbosity on all the high-traffic clusters" [puppet] - 10https://gerrit.wikimedia.org/r/233711 [13:54:34] A config change (assuming the old version was puppetized) is easily reverted, but going from unpuppetized to puppetized is generally unrevertable [13:54:38] (03PS2) 10Giuseppe Lavagetto: Revert "nutcracker: reduce logs verbosity on all the high-traffic clusters" [puppet] - 10https://gerrit.wikimedia.org/r/233711 [13:54:47] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "nutcracker: reduce logs verbosity on all the high-traffic clusters" [puppet] - 10https://gerrit.wikimedia.org/r/233711 (owner: 10Giuseppe Lavagetto) [13:54:56] (03CR) 10Giuseppe Lavagetto: [V: 032] Revert "nutcracker: reduce logs verbosity on all the high-traffic clusters" [puppet] - 10https://gerrit.wikimedia.org/r/233711 (owner: 10Giuseppe Lavagetto) [13:55:22] <_joe_> moritzm: ok to merge your change? [13:55:49] _joe_: I just did that 10s ago [13:55:58] <_joe_> ok [13:56:18] !log dropping old tables on s7 - T5493 [13:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:24] <_joe_> now I have to understand why "4" is not a valid value according to our puppet repo, wtf [13:57:09] <_joe_> valhallasw`cloud: I specifically didn't discuss the self-merge, if that is your complaint, I hear you. Still I can understand why it was thought to be uncontroversial [13:57:30] PROBLEM - puppet last run on mw1174 is CRITICAL puppet fail [13:57:32] PROBLEM - puppet last run on mw1231 is CRITICAL puppet fail [13:57:35] <_joe_> (which is what usually gets to self-merges [13:57:39] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1571038 (10Aklapper) >>! In T109810#1570888, @akosiaris wrote: > I am not sure which project to associate with this task. Seems like the big bucket operations project was (well AFAIC, so that it is not forgotte... [13:57:41] PROBLEM - puppet last run on mw2069 is CRITICAL puppet fail [13:57:41] <_joe_> sigh that's me ^^ [13:57:41] PROBLEM - puppet last run on mw1226 is CRITICAL puppet fail [13:57:41] PROBLEM - puppet last run on mw1215 is CRITICAL puppet fail [13:57:49] <_joe_> the fails will recover [13:57:57] <_joe_> but wtf is wrong with puppet [13:58:00] PROBLEM - puppet last run on mw2157 is CRITICAL puppet fail [13:58:01] PROBLEM - puppet last run on mw1142 is CRITICAL puppet fail [13:58:05] Yes, I trust that this change is entirely uncontroversial for prod hosts :-) [13:58:10] PROBLEM - puppet last run on mw1090 is CRITICAL puppet fail [13:58:20] PROBLEM - puppet last run on mw1220 is CRITICAL puppet fail [13:58:21] PROBLEM - puppet last run on mw1026 is CRITICAL puppet fail [13:58:31] PROBLEM - puppet last run on mw2081 is CRITICAL puppet fail [13:58:32] PROBLEM - puppet last run on mw2023 is CRITICAL puppet fail [13:58:36] PROBLEM - puppet last run on mw2021 is CRITICAL puppet fail [13:58:36] PROBLEM - puppet last run on mw2020 is CRITICAL puppet fail [13:58:41] PROBLEM - puppet last run on mw1082 is CRITICAL puppet fail [13:58:42] PROBLEM - puppet last run on mw2043 is CRITICAL puppet fail [13:58:42] PROBLEM - puppet last run on mw2036 is CRITICAL puppet fail [13:58:50] PROBLEM - puppet last run on mw1228 is CRITICAL puppet fail [13:58:51] PROBLEM - puppet last run on mw1112 is CRITICAL puppet fail [13:59:00] PROBLEM - puppet last run on mw1060 is CRITICAL puppet fail [13:59:00] PROBLEM - puppet last run on mw1139 is CRITICAL puppet fail [13:59:00] PROBLEM - puppet last run on mw1120 is CRITICAL puppet fail [13:59:01] PROBLEM - puppet last run on mw1008 is CRITICAL puppet fail [13:59:11] PROBLEM - puppet last run on mw1222 is CRITICAL puppet fail [13:59:19] <_joe_> valhallasw`cloud: uncontroversial as in "no real consequence for anyone" [13:59:22] PROBLEM - puppet last run on mw2145 is CRITICAL puppet fail [13:59:31] PROBLEM - puppet last run on mw1009 is CRITICAL puppet fail [13:59:41] PROBLEM - puppet last run on mw2016 is CRITICAL puppet fail [13:59:41] PROBLEM - puppet last run on mw2050 is CRITICAL puppet fail [13:59:41] PROBLEM - puppet last run on mw2073 is CRITICAL puppet fail [13:59:41] PROBLEM - puppet last run on mw1203 is CRITICAL puppet fail [13:59:41] PROBLEM - puppet last run on mw1086 is CRITICAL puppet fail [14:00:00] PROBLEM - puppet last run on mw2158 is CRITICAL puppet fail [14:00:00] PROBLEM - puppet last run on mw2126 is CRITICAL puppet fail [14:00:00] PROBLEM - puppet last run on mw2045 is CRITICAL puppet fail [14:00:00] PROBLEM - puppet last run on mw2129 is CRITICAL puppet fail [14:00:00] PROBLEM - puppet last run on mw2024 is CRITICAL puppet fail [14:00:12] PROBLEM - puppet last run on mw1061 is CRITICAL puppet fail [14:00:40] PROBLEM - puppet last run on mw1170 is CRITICAL puppet fail [14:00:41] PROBLEM - puppet last run on mw2018 is CRITICAL puppet fail [14:00:41] PROBLEM - puppet last run on mw1135 is CRITICAL puppet fail [14:00:41] PROBLEM - puppet last run on mw2207 is CRITICAL puppet fail [14:00:50] PROBLEM - puppet last run on mw1110 is CRITICAL puppet fail [14:01:36] <_joe_> valhallasw`cloud: the fact that people complain doesn't say it has a merit. But I do get why you guys feel this way (I'm in the same room as yuvi right now, so he expressed his point of view quite clearly, I did too). I am used to make choices for my "users" since I started doing this work, as a volunteer or as a paid professional, and I try very hard not to screw up anyone's workflow. Who did [14:01:42] <_joe_> get his/her workflow harmed by this change? [14:02:00] PROBLEM - puppet last run on mw1039 is CRITICAL puppet fail [14:02:40] PROBLEM - puppet last run on mw1119 is CRITICAL puppet fail [14:02:43] _joe_: In my opinion, the message 'ask your system administrator to install package X' provides value for tool labs users [14:02:47] <_joe_> because in the end if it's "I like it I want it", that doesn't have that much merit in general - that is until labs like prod is managed by someone and not just a public cloud [14:03:05] because they are mostly clueless when it comes to why stuff doesn't work [14:03:28] <_joe_> valhallasw`cloud: ok, maybe say that in the ticket :) This is the first valid point I hear about this [14:03:39] <_joe_> err, the change [14:03:41] the other side of the coin is that this actively broke things [14:03:47] valhallasw@tools-exec-1201:~$ wat [14:03:47] /usr/bin/python: can't find '__main__' module in '/usr/share/command-not-found' [14:04:06] but that's fixable [14:04:07] <_joe_> oh, I see, it wasn't removed from the basic bash profile [14:04:09] <_joe_> yes [14:04:29] <_joe_> so yeah we can do that better, I agree [14:05:04] <_joe_> right now I have to understand why 4 is not a valid value for ^\d$ according to puppet [14:05:06] (03PS1) 10Yuvipanda: aptly: Setup client role [puppet] - 10https://gerrit.wikimedia.org/r/233713 [14:05:31] (03PS1) 10Andrew Bogott: Wait for confirmation before deleting migrated files on the source host. [puppet] - 10https://gerrit.wikimedia.org/r/233714 [14:05:37] _joe_: I misread you as why "4 is not a valid value for 5" [14:05:46] To which I had a very obvious answer, until I reread [14:05:49] (03CR) 10jenkins-bot: [V: 04-1] aptly: Setup client role [puppet] - 10https://gerrit.wikimedia.org/r/233713 (owner: 10Yuvipanda) [14:05:52] * ostriches wanders off to find coffee [14:06:07] _joe_: only thing I can think of is that 4 in yaml might be a float and not an int [14:06:29] it's pretty clearly not a string in either puppet (the 5) or in yaml [14:06:32] <_joe_> valhallasw`cloud: you would like that. I think is that hiera stringifies everything it returns or something [14:06:37] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1571048 (10akosiaris) [14:06:37] (03CR) 10Andrew Bogott: [C: 032] Wait for confirmation before deleting migrated files on the source host. [puppet] - 10https://gerrit.wikimedia.org/r/233714 (owner: 10Andrew Bogott) [14:06:38] 6operations, 10Wikimedia-Mailing-lists: reinstall fermium with public IP - https://phabricator.wikimedia.org/T109890#1571046 (10akosiaris) 5Open>3Resolved And... done. Resolving. [14:06:43] <_joe_> or something [14:06:44] (03PS2) 10Yuvipanda: aptly: Setup client role [puppet] - 10https://gerrit.wikimedia.org/r/233713 [14:06:50] 6operations, 10Wikimedia-Mailing-lists: hold lists.wikimedia.org with exim - https://phabricator.wikimedia.org/T110136#1571049 (10JohnLewis) a:5JohnLewis>3Dzahn I've looked at this. If we add lists.wikimedia.org to the hold domains and increase the retry time (to at least 4 hours, the window length), then... [14:06:50] oh, right. Hiera is somewhere inbetween of course [14:06:52] <_joe_> I'm gonna nail this down locally :) [14:07:23] it does stringify I think, as I saw something in 4.0 that let's you not stringify all the things [14:07:32] akosiaris: awesome with fermium! :) [14:07:59] <_joe_> chasemp: what's not clear to me is why stringifying 4 makes it now pass the regexp [14:08:54] ...yeah, puppet and types are a strange cruel joke on the world [14:09:09] (03PS3) 10Yuvipanda: aptly: Setup client role [puppet] - 10https://gerrit.wikimedia.org/r/233713 [14:09:16] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1571052 (10jcrespo) I've performed a backup in iron and then deleted all `delete*old` tables from shards 1-7. Planned for tomorrow, `optin_survey_old` tables. But let's check that this has effect... [14:09:18] (03CR) 10Yuvipanda: [C: 032] aptly: Setup client role [puppet] - 10https://gerrit.wikimedia.org/r/233713 (owner: 10Yuvipanda) [14:14:58] (03PS1) 10Alex Monk: Re-add mul.wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/233716 (https://phabricator.wikimedia.org/T64717) [14:15:35] <_joe_> chasemp: this is worse than I expected [14:15:42] 6operations, 10MediaWiki-Sites, 10SEO, 5MW-1.26-release, 5Patch-For-Review: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#1571061 (10Krinkle) >>! At rMW155d555b83eca6403e07d2094b074a8ed2f301ae, @Seb35 wrote: > I monit... [14:15:54] <_joe_> if you pass an integer as a class parameter, it gets casted to string when doing validate_re [14:16:05] <_joe_> if you pass it as an integer from hiera, it will fail [14:16:08] <_joe_> WAT? [14:16:30] RECOVERY - puppet last run on mw1142 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:34] that sounds like puppet [14:16:34] that literally makes no sense [14:16:38] * YuviPanda rewrites puppet in perl6 [14:17:41] that can't be [14:17:46] it can't be that bad [14:17:53] <_joe_> akosiaris: do you want proof? [14:17:57] <_joe_> I got proof [14:17:58] yes please! [14:18:06] <_joe_> even in puppet 3.7 [14:18:17] <_joe_> ok so you've seen my change fail before right? [14:18:19] this deserves public mockery [14:18:21] <_joe_> wait my next change [14:18:32] ? [14:19:12] well we're often blending 3 or more languages in our puppet repo that a given data item passes through, and trying to pretend that all type magic will Just Work [14:19:41] <_joe_> no here it's just something in how puppet binds ruby native integers when looking up on hiera [14:19:58] <_joe_> bblack: I tried the plain hiera yaml plugin from puppetlabs! [14:20:03] heh [14:20:15] huh. https://ask.puppetlabs.com/question/3298/how-do-i-use-validate_re-to-check-integers/ suggests validate_re should not take an integer argument even from the class definition [14:21:25] so they 'fixed' that somehow in the last two years? [14:21:45] oh, no, it's the same issue [14:22:16] so it's not hiera, it's just puppet itself [14:22:24] <_joe_> yeah but mind it - if the integer is in the class def, it will compile "correctly" [14:22:34] <_joe_> if you pass it an integer - via hiera or otherwise [14:22:34] I saw the puppet guy on stage a few years ago making fun of print not being a function in python as part of his "isn't python ridiculous" intro [14:22:37] <_joe_> it will fail [14:22:38] and I thought, glass houses amigo [14:23:12] <_joe_> valhallasw`cloud: so it's dumber that it seems [14:23:43] <_joe_> class wat ($a =4) { validate_re($a, '\d') }; include wat WORKS [14:23:54] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1571084 (10akosiaris) so ``` select * from user where userName='Mdholloway'; ``` does return indeed a user. I am hesitant though to run the delete since no table in that database... [14:24:07] <_joe_> class wat ($a =4) { validate_re($a, '\d') }; class {wat: a => 4 } DOES NOT [14:24:14] _joe_: and if you do Integer $a=4? [14:24:30] RECOVERY - puppet last run on mw1231 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:24:40] RECOVERY - puppet last run on mw1026 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:24:49] <_joe_> valhallasw`cloud: is Integer a thing in puppet? [14:24:52] <_joe_> never used it :P [14:25:02] _joe_: according to https://docs.puppetlabs.com/puppet/latest/reference/lang_classes.html#class-parameters-and-variables it is [14:25:08] that's puppet server [14:25:10] RECOVERY - puppet last run on mw1222 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:25:10] "Each parameter can be preceeded by an optional data type. If you include one, Puppet will check the parameter’s value at runtime to make sure that it has the right data type, and raise an error if the value is illegal." [14:25:12] <_joe_> valhallasw`cloud: in 4.x probably [14:25:12] written in clojure [14:25:22] oh [14:25:23] right [14:25:24] and having strict data types [14:25:24] yes, 4.x [14:25:30] RECOVERY - puppet last run on mw2157 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:25:30] RECOVERY - puppet last run on mw1082 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:25:31] RECOVERY - puppet last run on mw1174 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:25:40] RECOVERY - puppet last run on mw1215 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:25:40] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [14:25:40] RECOVERY - puppet last run on mw1228 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:25:48] <_joe_> akosiaris: no not even that [14:25:51] RECOVERY - puppet last run on mw1203 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:26:00] RECOVERY - puppet last run on mw1112 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:26:05] how are we going to ever migrate to that thing ... [14:26:10] RECOVERY - puppet last run on mw1226 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:10] RECOVERY - puppet last run on mw2069 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:11] RECOVERY - puppet last run on mw1120 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:26:11] RECOVERY - puppet last run on mw1060 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:18] puppet is making less and less sense to me these days [14:26:19] RECOVERY - puppet last run on mw1139 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:20] RECOVERY - puppet last run on mw1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:30] RECOVERY - puppet last run on mw1220 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:31] how on earth did they manage to fail so badly [14:26:49] RECOVERY - puppet last run on mw2145 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:50] RECOVERY - puppet last run on mw1009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:51] <_joe_> akosiaris: they also removed "import" [14:26:57] it's probably this part: "Internally, Puppet treats numbers like strings until they are used in a numeric context." [14:26:59] <_joe_> but not how you would think they did [14:26:59] RECOVERY - puppet last run on mw2081 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:59] RECOVERY - puppet last run on mw1090 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:00] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:27:00] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:27:01] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:27:10] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:27:10] RECOVERY - puppet last run on mw2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:11] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:11] RECOVERY - puppet last run on mw2020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:11] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:27:19] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:27:20] RECOVERY - puppet last run on mw1086 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:25] and passing an integer to a class probably counts as 'numeric context' [14:27:30] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:38] <_joe_> valhallasw`cloud: "internally, puppet is a bunch of brogrammers who thought 'types are uncoool, duuude'" [14:27:39] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [14:27:39] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:39] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:40] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:40] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:28:00] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:28:09] _joe_: nonono, you are misunderstanding. Puppet is not a programming language, it's a /configuration/ language. So it doesn't have to deal with fancy stuff like 'consistency'. [14:28:11] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:23] <_joe_> valhallasw`cloud: or common sense [14:28:30] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:39] (03PS1) 10Giuseppe Lavagetto: nutcracker: re-reduce verbosity [puppet] - 10https://gerrit.wikimedia.org/r/233717 [14:29:25] _joe_: Rob Browning, the Debian emacs maintainer (so a very sane person by definition) started to work at Puppet Labs recently (on the Clojure version), maybe it'll all work out [14:29:30] RECOVERY - puppet last run on mw2024 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:31] RECOVERY - puppet last run on mw1039 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:29:44] heh, not sure if the 'very sane person' is sarcastic or not... [14:29:46] <_joe_> moritzm: are you an emacs user? [14:30:23] _joe_: my fingers are hard-wired, can hardly use anything else [14:30:32] <_joe_> akosiaris: wanted proof? https://puppet-compiler.wmflabs.org/835/ [14:30:38] evil mode is the way out of that :P [14:30:56] <_joe_> nutcracker::verbosity: "4" vs nutcracker::verbosity: 4 [14:32:10] sigh... [14:32:58] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: re-reduce verbosity [puppet] - 10https://gerrit.wikimedia.org/r/233717 (owner: 10Giuseppe Lavagetto) [14:33:56] * _joe_ will be the founder of /r/lolpuppet [14:35:25] akosiaris: have you heard, is the client going clojure as well? [14:35:42] <_joe_> chasemp: no they rewrote that in eiffel [14:36:49] <_joe_> (I'm sure at least a couple of you checked on google if that was actually the case) [14:37:53] <_joe_> moritzm: maybe he can fix puppet.el as well [14:38:11] <_joe_> indentation rules are not always sane, or conforming to puppetlabs specifications [14:39:25] vim has pretty good indentaton, except for the fact that it doesn't recognize nested hashes [14:39:29] chasemp: I think they idea was C++ [14:39:31] and aligns all thos arrows on the same lines [14:39:41] perl6 all the things! [14:39:45] but it's puppetlabs tomorrow it might be whitespace or brainfuck [14:40:12] (I wrote a perl6 module yesterday night! https://github.com/yuvipanda/perl6-Ident-Client) [14:40:34] welp, your long term forecasting of puppet is looking more and more salient [14:41:05] lol [14:42:38] so is the new puppet language actually a programming language? I know they were experimenting with building manifests using Ruby at some point, but I don't think they actually did that in the end [14:43:29] well, at least there are loops [14:47:16] <_joe_> valhallasw`cloud: well I hope they implemented a map() function, mainly [14:47:27] they did IIRC [14:47:44] <_joe_> I mean they started with the correct idea (making the language declarative) and ended up fucking up all the subsequent choices [14:48:03] hehe [14:48:04] <_joe_> but hey, we're using that, so maybe it wasn't that bad [14:48:13] they added implicit ordering and some functional behavior to this declarative language [14:48:33] <_joe_> chasemp: I'm not sure I like implicit ordering [14:49:14] yeah, it's one of those "good impression for the first 10 minutes and wtf for the rest of your life" [14:49:29] <_joe_> ahah [14:49:32] (03PS3) 10Muehlenhoff: Add ferm rules for swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/233687 (https://phabricator.wikimedia.org/T104965) [14:49:34] I can see why they felt the need but it's a cluster mess of inconsistent things [14:49:48] I agree you want to seperate resource declaration and execution, but I'm not convinced a custom language was the best choice. You could just build a declaration programmatically, with any existing language. [14:49:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/233687 (https://phabricator.wikimedia.org/T104965) (owner: 10Muehlenhoff) [14:50:59] (03PS3) 10ArielGlenn: dumps: tweak stages a bit [puppet] - 10https://gerrit.wikimedia.org/r/233418 [14:53:35] (03CR) 10ArielGlenn: [C: 032] dumps: tweak stages a bit [puppet] - 10https://gerrit.wikimedia.org/r/233418 (owner: 10ArielGlenn) [14:56:00] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1571173 (10Krinkle) [14:56:37] (03CR) 10Krinkle: "No longer deployed in integration. The new dns system works fine." [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150825T1500). [15:02:26] (03CR) 10Giuseppe Lavagetto: [C: 031] "this will forbid access to port 80 and 443 of apache2, but I think that is desired as they're open by accident more than by will." [puppet] - 10https://gerrit.wikimedia.org/r/228784 (owner: 10Muehlenhoff) [15:05:25] matt_flaschen, around? [15:05:43] oh, sorry, mlitn put it up for swat [15:06:00] I’m around [15:07:30] !log krenair@tin Synchronized php-1.26wmf19/extensions/Flow: https://gerrit.wikimedia.org/r/#/c/233718/ (duration: 00m 16s) [15:07:31] mlitn, please test [15:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:29] PuppetSWAT in a few hours! [15:09:53] YuviPanda, shouldn't that be on the deployments calendar? [15:09:58] Krenair: it is [15:10:12] they're on next week's... [15:10:26] aaarrrghhhhh [15:10:31] I thought that was this week [15:10:54] confused what today's date was... [15:10:55] <_joe_> lol [15:10:58] let me put them in the proper place [15:11:00] <_joe_> ahahah [15:11:07] (03PS2) 10Glaisher: Re-add mul.wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/233716 (https://phabricator.wikimedia.org/T64717) (owner: 10Alex Monk) [15:11:17] (03PS3) 10Glaisher: Re-add mul.wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/233716 (https://phabricator.wikimedia.org/T64717) (owner: 10Alex Monk) [15:11:42] (03CR) 10Glaisher: "I don't understand why this was removed." [dns] - 10https://gerrit.wikimedia.org/r/233716 (https://phabricator.wikimedia.org/T64717) (owner: 10Alex Monk) [15:12:27] Krenair: they're in their correct places now [15:12:41] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [15:12:55] Krenair: works fine, thanks! [15:14:00] RECOVERY - Disk space on labvirt1007 is OK: DISK OK [15:15:54] (03CR) 10Yuvipanda: [toollabs] add script to generate python package listings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) (owner: 10Merlijn van Deen) [15:15:59] valhallasw`cloud: so those are just minor nites [15:16:00] *nits [15:16:12] valhallasw`cloud: I'm quite happy to merge this, but have a question of virtualenv vs packages... [15:16:55] I just generally think we should encourage *everyone* to use virtualenvs [15:17:09] for anything that doesn't need to be compiled [15:17:31] valhallasw`cloud: do you think otherwise? if so, why? [15:20:52] (03CR) 10Alex Monk: [C: 04-1] "Does not merge" [puppet] - 10https://gerrit.wikimedia.org/r/204996 (owner: 10Legoktm) [15:22:42] (03CR) 10Alex Monk: "Hashar: Ping" [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [15:22:49] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:25:01] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1571242 (10jcrespo) The new servers are about to arrive. There are 2 options for the immediate migration, before compression: * Stop writing to es2 (blobs_cluster24) and es3 (blobs_cluster25), put... [15:25:36] (03CR) 10Alex Monk: "I think you need to file a ticket with security=Access request to get ops to review this" [puppet] - 10https://gerrit.wikimedia.org/r/219151 (owner: 10Aklapper) [15:25:54] heh, no patches for puppetswat [15:26:35] (03CR) 10Alex Monk: [C: 04-1] "T108474 has been closed as invalid" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [15:27:26] YuviPanda: yes/no/maybe. I'd also like to give people a rich environment to quickly write things [15:27:56] valhallasw`cloud: hmm, so current situation is 'if you are using any of these already installed libraries, go ahead, but anything else you'd have to setup a virtualenv [15:27:56] ' [15:28:54] yes [15:29:03] !log mwscript deleteEqualMessages.php --wiki eowiki (T45917) [15:29:06] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1571247 (10jcrespo) I think everything is here but the disks? It would be nice to have 2 units ASAP mounted alongside es1006 and es1009 (that will substitute), prepared to clone the exist... [15:29:08] (03CR) 10Nemo bis: "I don't understand how that's relevant. The target is valid whether the prefix is a namespace or not." [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [15:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:26] basically, I think the question 'is it in ubuntu' answers the question 'is it useful for broader use than a single venv' [15:29:51] but we can also go full-venv by using wheels [15:29:57] for example [15:31:43] valhallasw`cloud: hmm, but we can also do that now with --system-packages, right? [15:32:15] YuviPanda, does https://gerrit.wikimedia.org/r/#/c/232193/ sound like a good candidate for puppet swat? [15:32:18] putting together a list [15:32:39] someone working with eowiki from tin? [15:32:46] krinkle is [15:33:00] or rather, was [15:33:11] Krenair: if there's some easy way to verify it in a short time, then yes that can be in SWAT. From comments I'm also not sure if erik thinks it's ready for merge or not? [15:33:11] jynus: Yeah, I ran a quick maintenance script to clear some outdated mediawiki-namespace pages [15:33:17] jynus: What's up? [15:33:36] oh, sorry, it logs as error a pure info thing [15:33:48] What does? [15:33:49] "Transaction already in progress" [15:33:55] not an issue, it seems [15:33:59] YuviPanda, which comments? [15:33:59] Hm.. [15:34:13] but it shows as level"ERROR" [15:34:20] Krenair: none, basically :) It's just him uploading 4 patchsets, so I guess I'm not sure if it's ok to be merged [15:34:21] jynus: Where? [15:34:24] on the other hand, no -2 nor a WIP tag [15:34:26] so maybe it is? [15:34:41] <_joe_> Krenair: I'd like for patch authors to be present during SWAT btw :) [15:34:43] that too ^ [15:34:52] Krinkle, wfLogDBError. I monitor it from kibana [15:34:52] at least 'someone involved' [15:34:55] _joe_, okay, will just suggest they put it up themselves then [15:35:04] <_joe_> thanks :) [15:35:16] jynus: https://github.com/wikimedia/mediawiki/blob/master/maintenance/deleteEqualMessages.php [15:35:39] jynus: I'm about to run another one, let's see if it's deterministic [15:36:15] !log mwscript deleteEqualMessages.php --wiki euwiki (T45917) [15:36:18] valhallasw`cloud: anyway, that discussion is perhaps much larger :) I'm ok merging it if you respond to the two nits [15:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:47] valhallasw`cloud: and maybe add a README? [15:37:15] Krinkle, I got only one this time [15:37:16] (03PS6) 10Alex Monk: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [15:38:06] (03CR) 10Alex Monk: [C: 04-1] "PS3 removed the check for the host being mira before adding base::firewall, commit message needs updating or that change to be reverted" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [15:38:17] <_joe_> this ^^ is not swat material btw :) [15:38:32] <_joe_> activating the firewall on tin needs careful consideration [15:38:35] yes [15:38:42] the original commit was not supposed to add firewall to tin [15:38:55] however PS3 changed that without updating the commit message [15:39:00] Krinkle, as I said, I am not worried, just saw an spike from a non mw* host and wanted to see if there was something wrong going on [15:39:12] I left a -1 for it [15:39:19] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 7Monitoring: Replace uses of monitoring::ganglia with monitoring::graphite_* - https://phabricator.wikimedia.org/T90642#1571305 (10Ottomata) We removed a bunch of monitoring::ganglia usages as part of the Kafka upgrade and expansion. The only one that i... [15:40:08] <_joe_> ottomata: a bunch of analytics nodes have puppet disabled, I guess it's you [15:40:28] jynus: So where exactly are you looking? [15:40:50] https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError ? [15:41:06] Krinkle, yes [15:41:08] * Krinkle files a bug [15:41:09] thanks [15:42:28] (03CR) 10Ori.livneh: [C: 032] Initial commit of ConfigurationObserver unit tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/233672 (owner: 10Ori.livneh) [15:42:38] (03PS2) 10Alex Monk: Rewrite download.wiki(p|m)edia.org urls to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/233658 (https://phabricator.wikimedia.org/T107575) (owner: 10Chad) [15:42:47] (03Merged) 10jenkins-bot: Initial commit of ConfigurationObserver unit tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/233672 (owner: 10Ori.livneh) [15:43:01] jynus: https://phabricator.wikimedia.org/T110189 [15:43:26] YuviPanda, _joe_: What about https://gerrit.wikimedia.org/r/#/c/233658/1 ? [15:43:42] It needs to land before a DNS change can take place (https://gerrit.wikimedia.org/r/#/c/233659/1 ), but shouldn't have any effect until then [15:44:05] specially if it says "implicit commit". I would be worried if it said "rolling back change" [15:44:50] !log dist-upgrade and rebooting nembus in an attempt to resolve this acpi_pad issue [15:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:05] <_joe_> Krenair: uhm in theory, it might be, yes [15:46:33] <_joe_> I could grasp the context in less than 5 minutes, I'd say it's a green light :) [15:47:28] Krinkle, changed the link to a permalink :) [15:47:37] Although I'm not the patch author, it seems straightforward enough for me [15:47:38] (03PS2) 10Alexandros Kosiaris: base::service_unit: ship systemd units in /lib [puppet] - 10https://gerrit.wikimedia.org/r/233626 [15:49:26] I have no idea about https://gerrit.wikimedia.org/r/#/c/229136/ [15:49:38] probably something for apergos [15:50:25] !log powercycle ms-be1004, likely xfs [15:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:08] and https://gerrit.wikimedia.org/r/#/c/184637/ ... maybe? [15:52:10] RECOVERY - Host ms-be1004 is UPING OK - Packet loss = 0%, RTA = 0.77 ms [15:52:54] (03CR) 10BryanDavis: [C: 031] Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) (owner: 10EBernhardson) [15:53:29] RECOVERY - very high load average likely xfs on ms-be1004 is OK - load average: 65.21, 22.59, 8.09 [15:54:49] (03CR) 10Alex Monk: [C: 04-1] "Diffusion instead of Gitblit" [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [15:55:50] RECOVERY - Host cr1-eqdfw is UPING OK - Packet loss = 0%, RTA = 53.01 ms [15:56:31] (03CR) 10Alex Monk: Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [15:57:30] PROBLEM - DPKG on nembus is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:00:04] YuviPanda _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150825T1600). Please do the needful. [16:00:31] wooo [16:01:40] Krenair: am going through your patches first [16:01:56] Is https://gerrit.wikimedia.org/r/#/c/233658/ and https://gerrit.wikimedia.org/r/#/c/233659/ reasonable? [16:02:01] The latter is dns, not puppet tho [16:02:19] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL host 208.80.153.198, interfaces up: 29, down: 7, dormant: 0, excluded: 1, unused: 0BRxe-1/3/0: down - DISABLEDBRfxp0: down - BRxe-0/0/0: down - Core: cr1-codfw:xe-5/0/0 CyrusOne {#?} [10Gbps DWDM]BRxe-0/0/2: down - DISABLEDBRxe-0/0/3: down - DISABLEDBRxe-0/0/1: down - DISABLEDBRxe-1/2/0: down - DISABLEDBR [16:02:24] I actually put the puppet one up on my list because it seems straightforward enough [16:03:12] Krenair: _joe_ is looking at the mw2180 one, he's taking care of it without needing to depool it [16:03:33] and shouldn't have any user-visible effect until the DNS change is applied [16:04:30] 6operations, 10ops-codfw: mw2180 has a faulty disk - https://phabricator.wikimedia.org/T109687#1571431 (10Joe) @Krenair the system was installed but puppet never ran on it, I just misunderstood what papaul said. Doing it now. [16:04:46] Krenair: Ah, just noticed. [16:04:48] Thx [16:05:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Remove mw2180 from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/233638 (owner: 10Alex Monk) [16:05:32] (03CR) 10Giuseppe Lavagetto: "No need for this once puppet has finished running (right now...)" [puppet] - 10https://gerrit.wikimedia.org/r/233638 (owner: 10Alex Monk) [16:05:40] (03CR) 10Tim Landscheidt: "(Cf. Id9af35d8e33aea13d329bf5511d85ef7c578b87f for a fix for the error message.)" [puppet] - 10https://gerrit.wikimedia.org/r/233156 (owner: 10Alex Monk) [16:05:51] YuviPanda: yes, will do [16:05:53] (03Abandoned) 10Alex Monk: Remove mw2180 from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/233638 (owner: 10Alex Monk) [16:06:56] <_joe_> Krenair: well it's like if we merged it, right? :) [16:07:07] Something like that [16:07:33] The problem is being resolved properly isntead [16:07:39] instead* [16:08:07] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+-ownerin:ldap/ops,n,z is the list of things that should be looked at for future puppetswats :) [16:08:32] (03PS6) 10Yuvipanda: Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) (owner: 10EBernhardson) [16:08:35] <_joe_> ostriches: niiice :) [16:08:38] I usually add -label:Code-Review<=-1 [16:08:45] (03CR) 10Yuvipanda: [C: 032 V: 032] Fix mwgrep to work without dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/232193 (https://phabricator.wikimedia.org/T108151) (owner: 10EBernhardson) [16:08:52] Krenair: merged this one first, actually. [16:08:55] forcing a run on tin now [16:08:57] k [16:09:08] its easy :) [16:11:09] ebernhardson: Krenair try mwgrep now? [16:11:32] hmm :/ [16:11:36] urllib2.HTTPError: HTTP Error 400: Bad Request [16:11:54] !mwgrep [16:11:59] what is that mwgrep ? :-} [16:12:17] Grep for CSS or JS code in the MediaWiki NS of Wikimedia Wikis. [16:12:17] oh [16:12:24] It works on mira but not tin [16:12:29] So I guess the issue is in the mwgrep script itself [16:12:50] _joe_: even better link for "things ops should review and merge if reasonable" -- https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+is:open+-label:code-review%253D-1+-label:code-review%253D-2+-label:verified%253D-1+-ownerin:ldap/ops,n,z [16:13:06] 6operations, 5 Incident-20150825-Redis, 6Discovery, 3Discovery-Cirrus-Sprint, 7Elasticsearch: Update Elasticsearch for missing updates from outage on 20150825 - https://phabricator.wikimedia.org/T110179#1571452 (10EBernhardson) There are deletes to process as well, all documented at https://wikitech.wiki... [16:13:23] <_joe_> Krenair: so you get rid of 'send-echo-emails' because it's triggered by the ones you now run on terbium? [16:13:25] Krenair: uhm, should I revert? [16:13:37] yeah [16:13:41] RECOVERY - Apache HTTP on mw2180 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.106 second response time [16:13:45] Krenair: actually wait [16:13:45] Sorry [16:13:47] YuviPanda, yes [16:13:50] Krenair: it needs puppet to run on all the hosts... [16:13:56] running Elastic search as well [16:13:59] hmm, i'd like to look but i'm on train and internet not playing nice :( [16:14:03] oh right, yes this will need a run on the elastic search hosts [16:14:17] Krenair: yeah, so let's wait 20mins [16:14:19] ? [16:14:22] okay [16:14:58] _joe_, yes, send-echo-emails should be removed from silver because terbium's one should cover it now [16:15:06] What's "Elastic search?" [16:15:09] <_joe_> ok, perfect [16:15:37] <_joe_> ostriches: that thingy those two guys installed years ago, it was named "elasticsearch" at the time [16:15:40] PROBLEM - puppet last run on elastic1026 is CRITICAL puppet fail [16:15:54] herp derp, what's up elastic1026? [16:16:01] yeah let me look [16:16:04] <_joe_> now since they did split up it's "elastic" and "search" [16:16:12] stretchy search [16:16:18] <_joe_> and then we have "discovery" [16:17:43] !log mwscript deleteEqualMessages.php --wiki frpwiki [16:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:52] fix coming up [16:18:10] PROBLEM - puppet last run on elastic1017 is CRITICAL puppet fail [16:19:01] PROBLEM - puppet last run on elastic1008 is CRITICAL puppet fail [16:19:11] (03PS1) 10Yuvipanda: elasticsearch: Fix dependency on /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/233737 [16:19:30] PROBLEM - puppet last run on elastic1004 is CRITICAL puppet fail [16:19:31] PROBLEM - puppet last run on elastic1027 is CRITICAL puppet fail [16:19:46] (03CR) 10Chad: [C: 031] elasticsearch: Fix dependency on /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/233737 (owner: 10Yuvipanda) [16:20:02] (03CR) 10Yuvipanda: [C: 032 V: 032] elasticsearch: Fix dependency on /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/233737 (owner: 10Yuvipanda) [16:20:04] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/836/terbium.eqiad.wmnet/ looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/232866 (https://phabricator.wikimedia.org/T107547) (owner: 10Alex Monk) [16:20:07] thanks ostriches [16:20:10] yw [16:20:12] (03PS2) 10Giuseppe Lavagetto: Maintenance script maintenance for labswiki [puppet] - 10https://gerrit.wikimedia.org/r/232866 (https://phabricator.wikimedia.org/T107547) (owner: 10Alex Monk) [16:20:19] PROBLEM - puppet last run on elastic1018 is CRITICAL puppet fail [16:22:11] elasticsearch puppet failure taken care of [16:22:15] <_joe_> Krenair: merged [16:22:22] lesson learnt is to run all puppet swat patches through the compiler [16:22:28] thanks [16:22:37] <_joe_> I'm running puppet and will remove the script from silver [16:22:37] I wonder if we can make jenkins do that with a command [16:22:46] like, 'compile' as a comment on a change [16:22:47] <_joe_> YuviPanda: pcc is there for you [16:22:50] PROBLEM - puppet last run on elastic1016 is CRITICAL puppet fail [16:22:56] <_joe_> you don't know which hosts you want to compile from [16:23:06] compile mw1201.eqiad.wmnet no? [16:23:08] but yeah, pcc [16:23:30] PROBLEM - puppet last run on elastic1013 is CRITICAL puppet fail [16:23:40] PROBLEM - puppet last run on elastic1006 is CRITICAL puppet fail [16:23:40] RECOVERY - puppet last run on elastic1026 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:24:00] <_joe_> YuviPanda: ask hashar! [16:24:04] _joe_, you'll do it? great :) [16:24:22] <_joe_> Krenair: yep, I merged the patch as-is, I deal with cleanups [16:24:29] cool [16:24:53] 6operations: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571514 (10Slaporte) 3NEW [16:26:31] <_joe_> Krenair: re 232871 - I'd like to see someone working on mwcore give +1 to it [16:26:42] 6operations: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571523 (10Krenair) Who is this addressed to? Why is the current SSL certificate inadequate? Are you about to point this domain at a third party site or something? [16:26:42] <_joe_> I can say it's puppet-correct, not that it will DTRT [16:27:02] https://gerrit.wikimedia.org/r/#/c/208568/1 needs a manual rebase [16:27:14] _joe_, someone working on mw core? [16:27:15] 6operations: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571526 (10JohnLewis) policy.wikimedia.org is behind misc-web-lb as such it doesn't have its own SSL cert but instead uses star.wikimedia.org. So this would need a certificate to be issued if needs to be off the clu... [16:28:11] (03PS2) 10Yuvipanda: Remove unused group not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/208568 (owner: 10MaxSem) [16:28:16] <_joe_> Krenair: some developer with knowledge of what the scripts do [16:28:20] RECOVERY - HHVM processes on mw2180 is OK: PROCS OK: 6 processes with command name hhvm [16:28:46] Well FlaggedRevs is not from core [16:28:48] (03CR) 10Yuvipanda: [C: 032] "The group isn't specified anywhere, and the machine it was specified in was decommed a while ago." [puppet] - 10https://gerrit.wikimedia.org/r/208568 (owner: 10MaxSem) [16:30:18] _joe_, wikimedia-periodic-update.sh and update-special-pages are basically mwscriptwikiset with the script and dblists hard-coded [16:30:20] 6operations: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571630 (10Slaporte) Hi @krenair, the site will be hosted by WordPress (similar to the Wikimedia blog). What is the process to get an appropriate certificate issued? [16:30:35] (03CR) 10BryanDavis: General maintenance script cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/232871 (owner: 10Alex Monk) [16:30:51] Other than that nit ^ it LGTM [16:31:00] !log mwscript deleteEqualMessages.php --wiki frwiki [16:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:05] <_joe_> bd808: thanks :) [16:32:11] <_joe_> Krenair: I think bryan's comment is spot-on btw [16:32:12] ostriches: so https://gerrit.wikimedia.org/r/#/c/233658/2 - if I go to downloads.wikimedia.org it already redirects me to dumps... [16:32:28] https://downloads.wikimedia.org [16:32:28] <_joe_> YuviPanda: nope, it points to dumps [16:32:28] 6operations: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571699 (10Krenair) I think you'll need to request ops approve putting this on a third party site first. WMF would need to buy a new, separate, SSL certificate for policy.wikimedia.org. [16:32:31] <_joe_> doesn't redirect [16:32:35] 6operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571702 (10Krenair) [16:32:56] NXDOMAIN now? [16:32:57] Hmm [16:32:58] Weird. [16:32:58] <_joe_> Krenair: if you fix that small nit, the patch is GTG [16:33:06] Oh, download. [16:33:08] Not downloads. [16:33:11] great, doing [16:33:14] thanks bd808 [16:33:15] curl -I download.wikimedia.org/hello [16:33:20] Location: http://dumps.wikimedia.org/hello [16:33:27] try htps. [16:33:29] *https [16:33:34] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1571705 (10Ottomata) Natively share the dict? Hm. Just quickly tried this, and I get an immediate segfault: ``` Aug 25 16:32:25 cp1052 kernel: [8455259.595360] python[7589]: segfa... [16:33:39] <_joe_> the problem is with the dns ofc [16:33:57] YuviPanda: The motivation of the tasks is the mismatched certs. [16:34:19] yeah that makes sense! [16:34:24] I should've clicked the bug [16:34:46] (03PS3) 10Yuvipanda: Rewrite download.wiki(p|m)edia.org urls to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/233658 (https://phabricator.wikimedia.org/T107575) (owner: 10Chad) [16:35:09] 7Puppet, 6Labs, 3Labs-Sprint-104, 3Labs-Sprint-105, 5Patch-For-Review: Allow per-host hiera overrides via wikitech - https://phabricator.wikimedia.org/T104202#1571721 (10scfc) 5Open>3Resolved Verified by temporarily setting `"ssh::server::explicit_macs": true` at [[https://wikitech.wikimedia.org/wiki... [16:35:19] _joe_: https://gerrit.wikimedia.org/r/#/c/233659/ is the DNS change [16:35:34] <_joe_> k doing both [16:36:04] (03CR) 10Giuseppe Lavagetto: [C: 032] Rewrite download.wiki(p|m)edia.org urls to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/233658 (https://phabricator.wikimedia.org/T107575) (owner: 10Chad) [16:40:22] (03CR) 10Alex Monk: "Did you mean to merge this?" [software/rescue-pxe] - 10https://gerrit.wikimedia.org/r/178845 (https://phabricator.wikimedia.org/T78135) (owner: 10Filippo Giunchedi) [16:42:06] (03PS3) 10Alex Monk: General maintenance script cleanup [puppet] - 10https://gerrit.wikimedia.org/r/232871 [16:42:11] 6operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571869 (10Slaporte) Let me know what you need. Thanks! [16:42:49] RECOVERY - puppet last run on elastic1017 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:42:55] 6operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571874 (10Dzahn) We have been asked to set this up on T97329 back in April. Since then we have been waiting for content to be uploaded and there were no updates. How come now there is talk about moving... [16:42:59] RECOVERY - DPKG on nembus is OK: All packages OK [16:44:09] RECOVERY - puppet last run on elastic1004 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:44:51] RECOVERY - puppet last run on elastic1018 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:45:11] 6operations, 6WMF-Legal: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1571890 (10Dzahn) No updates here but T110197 talks about buying an SSL cert and moving it to a third party. Is "It will host a static html site" not true anymore? [16:45:49] RECOVERY - puppet last run on elastic1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:46:09] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [16:46:11] RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:47:12] (03PS4) 10Giuseppe Lavagetto: General maintenance script cleanup [puppet] - 10https://gerrit.wikimedia.org/r/232871 (owner: 10Alex Monk) [16:48:00] did puppet complete on mw2180? [16:48:11] RECOVERY - puppet last run on elastic1013 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:48:13] <_joe_> Krenair: it's running scap :) [16:48:18] ah [16:48:21] is nutcracker or memched broken on tin or something? [16:48:21] RECOVERY - puppet last run on elastic1006 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:48:38] There's something funny about memcached on tin [16:48:42] where was it... [16:48:43] https://phabricator.wikimedia.org/T110189#1571871 [16:48:46] <_joe_> Krinkle: very possible, lemme check [16:48:48] yep, that's the one [16:48:52] It was working fine a few weeks ago [16:48:56] https://phabricator.wikimedia.org/T103198 [16:48:59] Now every delete is taking like 10 seconds [16:49:13] anything different from terbium? [16:49:15] Krinkle: nutcracker is I think... terbium is a better place to run maintenance scripts as I recall [16:49:17] * Krinkle resumes on terbium instead [16:49:30] RECOVERY - puppet last run on elastic1016 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:49:32] thx [16:49:40] <_joe_> Krinkle: I'm checking it in a few, lemme finish with the puppet-merge [16:50:02] There was something about the deployment server role and nutcracker ... ostriches might remember what we had to do in beta cluster to work around it [16:50:11] Oh yeah, soooo much faster [16:50:28] bd808: Hrm? I don't remember doing a nutcracker ballet [16:50:44] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1571918 (10greg) >>! In T102020#1567220, @zeljkofilipin wrote: > @hashar: any idea on which folders contain third party code? redirect this question to... [16:51:41] RECOVERY - HHVM rendering on mw2180 is OK: HTTP OK: HTTP/1.1 200 OK - 64254 bytes in 3.820 second response time [16:51:50] it's that ticket that Krenair dug up -- https://phabricator.wikimedia.org/T103198 [16:51:55] <_joe_> Krinkle: can you run any command that uses memcached on tin? [16:52:39] _joe_: see ^^ [16:52:49] PROBLEM - LDAP on nembus is CRITICAL: Connection refused [16:53:04] <_joe_> bd808: yes, looking at the logs it seems nutcracker is mostly ok [16:53:13] <_joe_> and gets called [2015-08-25 16:53:07.146] nc_core.c:237 close s 20 '10.64.0.193:11211' on event 0000 eof 0 done 0 rb 0 sb 0: Connection timed out [16:53:19] 6operations: re-seat power cord for Nembus - https://phabricator.wikimedia.org/T110202#1571923 (10Andrew) 3NEW a:3Papaul [16:53:19] PROBLEM - LDAPS on nembus is CRITICAL: Connection refused [16:53:20] <_joe_> but this ^^ [16:53:38] _joe_: Running now [16:53:56] !log mwscript deleteEqualMessages.php --wiki huwiki [16:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:01] (on tin) [16:54:05] the config for nutcracker on tin is different from other MW hosts (or at least was amonth ago) [16:54:09] <_joe_> ohhh I got what is the problem [16:54:13] <_joe_> bd808: yeah wtf [16:54:18] <_joe_> ok fixing it [16:54:43] I think it was a puppet refactor that fell through the cracks probably [16:54:50] <_joe_> yeah [16:54:56] 6operations: re-seat power cord for Nembus - https://phabricator.wikimedia.org/T110202#1571932 (10Andrew) btw, please ping andrewbogott on IRC before doing this so I can handle some related fallout. [16:55:27] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1571933 (10Dzahn) Google Webmaster Tools things are not handled by ops anymore afaict (T83494) -> @chasemp [16:55:47] 6operations, 6WMF-Legal: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1571937 (10Slaporte) @Dzahn, that's right. The site will be hosted with our blog WordPress instance. [16:56:23] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1571938 (10RobH) [16:56:47] !log restarting pdns on labcontrol1001 and labcontrol2001 to handle a nembus reboot [16:56:48] (03CR) 10Yuvipanda: [C: 032] General maintenance script cleanup [puppet] - 10https://gerrit.wikimedia.org/r/232871 (owner: 10Alex Monk) [16:56:52] Krenair: ^ [16:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:26] 6operations, 6WMF-Legal: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1571948 (10Krenair) So do they actually need one cert that covers policy AND blog, or one for each? [16:57:28] Krenair: am forcing a run on tin [16:57:33] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1571938 (10RobH) I've added the two associated tasks as blocked by this one. T97329 - initial setup task of policy.wikimedia.org, still not resolved T110197 - request for certificate purchase (whi... [16:57:41] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1571953 (10RobH) [16:57:43] 6operations, 6WMF-Legal: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1571952 (10RobH) [16:57:45] Krenair: err, or should this be terbium? [16:57:46] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1571938 (10RobH) [16:57:48] 6operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571514 (10RobH) [16:58:02] !log mwscript deleteEqualMessages.php --wiki kawiki [16:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:38] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1571958 (10Papaul) we did change this again here is the final connection eqdfw xe-1/0/0 cable ID = 11395 on Equinix patch panel ID= 20028800 eqdfw xe-0/0/0 cable ID = 11399 on Equinix patch panel ID = 2002879... [16:58:39] YuviPanda, terbium, yeah [16:58:54] yeah, am forcing a run there now [16:59:22] 6operations, 6WMF-Legal: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1571968 (10RobH) 5stalled>3Open @Slaporte: I've created T110203 to address the questions regarding the migration. This task is stalled until the questions on that task have been addressed. @Krenair: N... [16:59:47] <_joe_> bd808, Krinkle_ I think I nailed it btw [17:00:25] Krenair: ok, the cron has changed on terbium [17:00:29] 6operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1571975 (10RobH) 5Open>3stalled @Slaporte: I've created T110203 to address the questions regarding the migration. This task is stalled until the questions on that task have been addressed. [17:00:36] Krenair: I guess we wait for the crons to run and verify? :) [17:01:03] they don't actually write any logs [17:01:11] ah, fun [17:01:17] just straight to /dev/null [17:01:19] so how do we test this? [17:01:33] outside of 'see if someone complains' [17:01:37] <_joe_> YuviPanda: we'll know tomorrow morning! [17:01:46] heh [17:01:54] could check the crontab on terbium and check the commands changed actually work as expected [17:02:02] alternatively, make them log somewhere for a while? [17:02:10] Krenair: hmm, the earlier one sounds ok [17:02:21] Krenair: do you have access to do that? or should I? [17:02:46] yes, these crons are owned by www-data so I should be able to do it [17:03:19] ok! [17:03:37] we've gone over time... [17:03:40] RECOVERY - LDAPS on nembus is OK: TCP OK - 0.053 second response time on port 636 [17:04:02] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1571999 (10Slaporte) > Where is the content? Will any content on our cluster need to be migrated or maintained by operations? No need to migrate any content. It's a new site under development. >... [17:04:07] Krenair: ostriches I think we should do the DNS change in Thursday's SWAT. [17:04:14] since we're already over time here. [17:04:23] and the train is supposed to start now [17:04:33] mmk [17:05:02] we did merge the apache change tho [17:05:05] that's still propogating as wel [17:05:06] l [17:05:13] fine with me [17:05:19] RECOVERY - LDAP on nembus is OK: TCP OK - 0.053 second response time on port 389 [17:05:20] cool [17:05:48] Krenair: ostriches ebernhardson ok, I'll call puppet swat done then :) [17:06:39] oh actually, one of these does log [17:06:44] * Krenair will be back soon [17:07:12] <_joe_> YuviPanda: no it's not [17:07:17] <_joe_> we have to merge the dns change [17:07:37] Krenair: these are wikidata dump related messages, it's just a straightforward translation update. https://gerrit.wikimedia.org/r/#/c/229136/ Siebrand should sign off on it, if the patch author has tested it then it can be merged after that [17:07:38] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1572027 (10mmodell) There is a command line tool to delete a user, rather than doing it in sql [17:08:20] (03PS1) 10John F. Lewis: lists: hold mail to lists.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/233750 (https://phabricator.wikimedia.org/T110136) [17:08:22] mutante: ^ [17:09:08] Krenair: I added hoo from wikidata, in case he has anything to say [17:09:19] Krenair: ostriches alright, so _joe_ wants to finish up the DNS change, so let's do it. [17:10:02] (03PS2) 10Yuvipanda: Point download.wiki(m|p)edia.org at text-lb [dns] - 10https://gerrit.wikimedia.org/r/233659 (https://phabricator.wikimedia.org/T107575) (owner: 10Chad) [17:10:13] (03PS1) 10Giuseppe Lavagetto: deployment::server: re-puppetize nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/233751 [17:10:14] !log bouncing Cassandra on restbase1001 to apply temporary GC settings [17:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:10:23] <_joe_> bd808, Krinkle_ ^^ [17:10:32] (03CR) 10Yuvipanda: [C: 032] Point download.wiki(m|p)edia.org at text-lb [dns] - 10https://gerrit.wikimedia.org/r/233659 (https://phabricator.wikimedia.org/T107575) (owner: 10Chad) [17:10:54] (03CR) 10ArielGlenn: [C: 031] "This +1 means that just from eyeballing it, it looks fine, Siebrand should do the same before it's merged though." [puppet] - 10https://gerrit.wikimedia.org/r/229136 (owner: 10Lokal Profil) [17:11:28] (03PS2) 10BryanDavis: deployment::server: re-puppetize nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/233751 (https://phabricator.wikimedia.org/T103198) (owner: 10Giuseppe Lavagetto) [17:11:40] !log run authdns-update on radon (ns0.wikimedia.org) [17:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:51] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1572039 (10Dzahn) >>! In T110203#1571999, @Slaporte wrote: > No need to migrate any content. It's a new site under development. Well, we were told back in April it's gonna be static HTML content a... [17:12:39] ostriches: Krenair done! [17:12:49] can you put that ptach too in the deployment calendar please? [17:12:49] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:13:00] YuviPanda, LGTM, thanks [17:13:01] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1572044 (10mmodell) @akosiaris: I can do the deletion if you'd prefer. But the way it's done is, on iridium, run the following command: `sudo /srv/phab/phabricator/bin/remove des... [17:13:10] JohnFLewis: great!:) [17:14:25] JohnFLewis: or maybe we could do "*"... hmm [17:14:28] YuviPanda, so that was 7 patches and it went over [17:14:33] Krenair: yeah [17:14:36] YuviPanda, maybe limit to 6 rather than 8 next time? [17:14:42] mutante: we could indeed [17:14:45] Krenair: although primarily because one of them required puppet to run on all mw hosts [17:15:00] Krenair: so we didn't parallelize enough - if we had done that first this wouldn't have run over [17:15:03] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/837/ seems ok but it's way too late for me to merge a potentially harmful change like this." [puppet] - 10https://gerrit.wikimedia.org/r/233751 (https://phabricator.wikimedia.org/T103198) (owner: 10Giuseppe Lavagetto) [17:15:12] YuviPanda: access requests can go in puppetSWAT? [17:15:23] matanya: nope, don't think so. [17:15:30] ok, thanks [17:15:31] I would say no, but I'm not ops :) [17:15:58] YuviPanda: even if approved? :P [17:16:38] JohnFLewis: yes :) [17:17:00] JohnFLewis: the on-duty person should deal with routine tasks like access requests and what not [17:19:27] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace SSH key for mholloway - https://phabricator.wikimedia.org/T110064#1572264 (10chasemp) >>! In T110064#1570612, @akosiaris wrote: > @chasemp, @mmodell, any pointers on how to resolve the above ? Thanks! >>! In T110064#1572044, @mmodell wrote: >... [17:20:21] it's the "what not" part [17:20:39] (03PS1) 10Faidon Liambotis: Readd cr1-eqdfw to smokeping [puppet] - 10https://gerrit.wikimedia.org/r/233753 [17:20:40] 6operations, 10Traffic: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1572269 (10Krenair) [17:21:06] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1572275 (10Krenair) [17:21:11] (03CR) 10Faidon Liambotis: [C: 032] Readd cr1-eqdfw to smokeping [puppet] - 10https://gerrit.wikimedia.org/r/233753 (owner: 10Faidon Liambotis) [17:22:37] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1572285 (10Slaporte) >>! In T110203#1572039, @Dzahn wrote: >>>! In T110203#1571999, @Slaporte wrote: >> No need to migrate any content. It's a new site under development. > > Well, we were told ba... [17:23:53] I get this from hive when running some query: Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException(Permission denied: user=smalyshev, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x [17:24:00] anybody knows what's missing? [17:24:27] permissions? :) [17:24:42] ori: I guess :) which ones? [17:25:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [17:29:32] 10Ops-Access-Requests, 6operations: Need access for smalyshev to hive queries on stat1002 - https://phabricator.wikimedia.org/T110217#1572377 (10Smalyshev) 3NEW [17:30:02] 10Ops-Access-Requests, 6operations: Need access for smalyshev to hive queries on stat1002 - https://phabricator.wikimedia.org/T110217#1572388 (10Smalyshev) [17:31:13] (03CR) 10CSteipp: [C: 031] Revert "Revert of Iab860b8a5: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php" [puppet] - 10https://gerrit.wikimedia.org/r/184637 (owner: 10Anomie) [17:31:24] mobrovac: hey! is spec.js in mobileapps repo something you invented, based on some other implementation, or generated by swagger? [17:32:00] SMalyshev: sorry, dunno [17:32:10] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1572401 (10chasemp) >>! In T109810#1571933, @Dzahn wrote: > Google Webmaster Tools things are not handled by ops anymore afaict (T83494) -> @chasemp I think the current arrangement is closer to Discovery being... [17:32:22] ori: ok, np, I submitted the request [17:33:34] (03CR) 10Dzahn: "it was added before in https://gerrit.wikimedia.org/r/#/c/173247/ why was it removed?" [dns] - 10https://gerrit.wikimedia.org/r/233716 (https://phabricator.wikimedia.org/T64717) (owner: 10Alex Monk) [17:36:01] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:45:21] (03CR) 10Jforrester: "What do Reading and Discovery think of this change? Potentially could impact things for which they're responsible." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [17:54:14] (03CR) 10Dzahn: "was removed in https://gerrit.wikimedia.org/r/#/c/223051/2 because it looked like it wasn't used, no traffic" [dns] - 10https://gerrit.wikimedia.org/r/233716 (https://phabricator.wikimedia.org/T64717) (owner: 10Alex Monk) [17:55:51] (03CR) 10Dzahn: [C: 031] Remove auto-redirection from 404 page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233664 (https://phabricator.wikimedia.org/T37052) (owner: 10MZMcBride) [17:56:25] (03PS4) 10Dzahn: Re-add mul.wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/233716 (https://phabricator.wikimedia.org/T64717) (owner: 10Alex Monk) [17:56:42] (03CR) 10Dzahn: [C: 032] Re-add mul.wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/233716 (https://phabricator.wikimedia.org/T64717) (owner: 10Alex Monk) [18:00:04] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150825T1800). Please do the needful. [18:03:38] (03PS2) 10MaxSem: Maps: Add geo-index to the water_polygons table [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [18:04:07] akosiaris, jynus - could you take a look ^^^ ? [18:06:07] (03CR) 10Yurik: [C: 031] Maps: Add geo-index to the water_polygons table [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [18:06:47] (03CR) 10Dzahn: [C: 04-1] "bd808 says that sql/sqldump uses wikiadmin_pass and the sql script is used by developers a lot from tin/terbium" [puppet] - 10https://gerrit.wikimedia.org/r/232903 (owner: 10Faidon Liambotis) [18:10:52] (03CR) 10Dzahn: "> Don't know whether to (ab)use the sql_user and sql_pass of the existing communitymetrics script." [puppet] - 10https://gerrit.wikimedia.org/r/233219 (https://phabricator.wikimedia.org/T85183) (owner: 10Aklapper) [18:12:28] (03CR) 10Dzahn: [C: 031] "this looks good to me, it's like what Ori said when we talked about it yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [18:14:07] (03CR) 10Alex Monk: "PS3 should handle that already Dzahn..." [puppet] - 10https://gerrit.wikimedia.org/r/232903 (owner: 10Faidon Liambotis) [18:15:44] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1572602 (10Jalexander) Yeah I'm not sure we have a formal set of "whose responsible for what" right now with Google Webmaster... Personally I'm happy to help confirm the connection but let me check with the law... [18:23:49] (03PS1) 1020after4: symlinks for 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233765 [18:24:44] (03CR) 10Dzahn: [C: 031] Remove/ensure=> absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 (owner: 10Faidon Liambotis) [18:27:16] (03CR) 10Dzahn: [C: 04-1] "what about role::releases::upload" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [18:27:26] (03CR) 1020after4: [C: 032] symlinks for 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233765 (owner: 1020after4) [18:27:34] (03Merged) 10jenkins-bot: symlinks for 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233765 (owner: 1020after4) [18:31:33] !log twentyafterfour@tin Started scap: testwiki to 1.26wmf20 [18:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:11] (03CR) 10Alex Monk: "It was removed from tin in Ica21c2a4, but also added to mira in Iee1a4447" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [18:32:59] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1572664 (10JohnLewis) [18:33:14] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1549626 (10JohnLewis) Added the table to the description. [18:34:30] JohnFLewis, is that plan not completely written then? [18:35:06] Krenair: I just asked mutante what is left. it seems complete to me when I looked it over during and after writing so [18:35:11] if it's completed, move the whole thing to the blocked ticket and close this one? [18:35:44] yeah that will be the plan afaics [18:36:16] (03CR) 10Mark Bergsma: "I'd like to store a full backup of that dir somewhere." [puppet] - 10https://gerrit.wikimedia.org/r/231142 (owner: 10Faidon Liambotis) [18:43:34] RECOVERY - Disk space on labstore1002 is OK: DISK OK [18:45:51] (03CR) 10Dzahn: "i have sent a mail about this to the engineering list" [puppet] - 10https://gerrit.wikimedia.org/r/231142 (owner: 10Faidon Liambotis) [18:47:03] (03CR) 10Dzahn: "the purpose of making this was to get the discussion going. that was successful. but multiple -1, so i think i'll abandon it and we contin" [dns] - 10https://gerrit.wikimedia.org/r/232669 (https://phabricator.wikimedia.org/T99216) (owner: 10Dzahn) [18:47:10] (03Abandoned) 10Dzahn: add CNAME videoserver.wm.org -> archive.org [dns] - 10https://gerrit.wikimedia.org/r/232669 (https://phabricator.wikimedia.org/T99216) (owner: 10Dzahn) [18:50:23] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1572697 (10RobH) I need to confirm with Chicago/Arul the PDUs are in place. [19:00:10] (03Abandoned) 10Dzahn: wmnet: fix indentations for readability [dns] - 10https://gerrit.wikimedia.org/r/231020 (owner: 10Dzahn) [19:02:10] (03PS7) 10Dzahn: ferm rules for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) [19:02:28] (03CR) 10Dzahn: [C: 032] ferm rules for nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [19:05:17] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1572769 (10RobH) [19:21:42] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:21:45] !log twentyafterfour@tin Finished scap: testwiki to 1.26wmf20 (duration: 50m 12s) [19:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:16] (03PS1) 1020after4: group0 wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233773 [19:22:33] (03CR) 1020after4: [C: 032] group0 wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233773 (owner: 1020after4) [19:22:38] (03Merged) 10jenkins-bot: group0 wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233773 (owner: 1020after4) [19:24:05] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.26wmf20 [19:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:09] (03CR) 10MaxSem: [C: 031] "My home dir is OK to nuke." [puppet] - 10https://gerrit.wikimedia.org/r/231142 (owner: 10Faidon Liambotis) [19:51:28] (03PS1) 10Awight: Sort log streams alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233781 [19:56:50] 7Blocked-on-Operations, 10Flow, 3Collaboration-Team-Current, 5Patch-For-Review, and 2 others: Separate reference tables by wiki - https://phabricator.wikimedia.org/T107204#1573004 (10Mattflaschen) @Etonkovidova This is a multi-phase task. You can do regression testing now and look at the DB, but it will o... [19:59:59] (03PS1) 10Dzahn: wmnet: fix some indentations and missing "IN"s [dns] - 10https://gerrit.wikimedia.org/r/233831 [20:00:04] tgr: Respected human, time to deploy OAuth (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150825T2000). Please do the needful. [20:00:52] doing a parsoid deploy to in a few mins to update our express library. [20:01:07] (03CR) 10Ori.livneh: "Already merged, but just chiming in to say this LGTM as well." [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [20:02:00] (03CR) 10Dzahn: "thanks Ori, i was about to ask you that :)" [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [20:02:37] (03CR) 10Dzahn: [C: 032] wmnet: fix some indentations and missing "IN"s [dns] - 10https://gerrit.wikimedia.org/r/233831 (owner: 10Dzahn) [20:09:19] (03PS9) 10Merlijn van Deen: [toollabs] add script to generate python package listings [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) [20:10:31] (03PS6) 10Dzahn: kibana: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230692 [20:10:36] (03CR) 10jenkins-bot: [V: 04-1] [toollabs] add script to generate python package listings [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) (owner: 10Merlijn van Deen) [20:10:39] YuviPanda: https://git.wikimedia.org/raw/operations%2Fpuppet.git/f076bacbcda87130560ada2ad62dd2b9592dcc9f/modules%2Ftoollabs%2Fmanifests%2Fgenpp%2Freport-python.html gah :< [20:10:45] why doesn't that just show the html :P [20:11:16] (03PS2) 10Rush: phab: reduce log history [puppet] - 10https://gerrit.wikimedia.org/r/230661 [20:12:24] (03CR) 10Dzahn: [C: 032] kibana: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230692 (owner: 10Dzahn) [20:12:47] aaand I'll name my jinja template something not .pp [20:12:52] valhallasw`cloud: I thikn the server sends text/plain mimetype to avoid random files from repos being rendered or attempt to render [20:12:54] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:13:06] Yeah, it's probably a good idea from a security perspective [20:13:21] !log deployed parsoid version 759916fc [20:13:23] (03PS3) 10Rush: phab: reduce log history [puppet] - 10https://gerrit.wikimedia.org/r/230661 [20:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:58] bd808: what i don't really get is this: [20:14:04] 50-logstash1001.conf: ServerName logstash.wmflabs.org [20:14:14] can't find the server at logstash.wmflabs.org. [20:14:24] (03CR) 10Rush: [C: 032 V: 032] phab: reduce log history [puppet] - 10https://gerrit.wikimedia.org/r/230661 (owner: 10Rush) [20:14:58] subbu: are you working in /srv/wikimedia-staging? [20:15:15] um, mediawiki-staging [20:15:26] (03PS4) 10Rush: diamond: nutcracker collector improvements [puppet] - 10https://gerrit.wikimedia.org/r/230259 [20:15:29] (03PS10) 10Merlijn van Deen: [toollabs] add script to generate python package listings [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) [20:15:49] tgr, no [20:15:53] i am also done with deploy. [20:16:00] cool, thanks [20:16:02] so, all yours. [20:16:29] (03CR) 10jenkins-bot: [V: 04-1] [toollabs] add script to generate python package listings [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) (owner: 10Merlijn van Deen) [20:20:40] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [20:20:54] (03PS2) 10Gergő Tisza: Set OAuth readonly for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) [20:21:06] (03PS2) 10Gergő Tisza: Change OAuth central wiki to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233609 (https://phabricator.wikimedia.org/T108648) [20:21:11] (03PS2) 10Gergő Tisza: End OAuth migration; reenable writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233610 (https://phabricator.wikimedia.org/T108648) [20:22:09] (03CR) 10Gergő Tisza: [C: 032] Set OAuth readonly for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [20:22:15] (03Merged) 10jenkins-bot: Set OAuth readonly for DB migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233608 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [20:23:31] (03PS11) 10Merlijn van Deen: [toollabs] add script to generate python package listings [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) [20:24:39] !log tgr@tin Synchronized wmf-config/CommonSettings.php: set OAuth to readonly for DB migration T108648 (duration: 00m 13s) [20:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:17] 6operations, 6Discovery, 7Elasticsearch: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#1573096 (10chasemp) 3NEW [20:29:51] bblack, hi, when would be a good time to chat re maps? i think mark was trying to set it up, but not sure what's the status of that [20:30:54] i'm not trying to set it up [20:30:56] (03CR) 10Gergő Tisza: [C: 032] Change OAuth central wiki to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233609 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [20:31:02] (03Merged) 10jenkins-bot: Change OAuth central wiki to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233609 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [20:31:03] coordinate with alex, and brandon if he wants to [20:31:09] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1573109 (10Papaul) Below are the cable ID's for cr1-codfw and cr2-codfw 11402 cr1-codfw xe-5/0/0 11403 cr2-codfw xe-5/0/0 will update the final diagram later [20:32:17] hello! Does anyone have a moment to add my phab account (also ejegg) to triagers? Our bulk-edit-enabled members are both going to burning man [20:32:31] !log tgr@tin Synchronized wmf-config/CommonSettings.php: change wgMWOAuthCentralWiki mediawikiwiki -> metawiki T108648 (duration: 00m 12s) [20:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:26] (03CR) 10Gergő Tisza: [C: 032] End OAuth migration; reenable writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233610 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [20:33:32] (03Merged) 10jenkins-bot: End OAuth migration; reenable writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233610 (https://phabricator.wikimedia.org/T108648) (owner: 10Gergő Tisza) [20:34:11] 6operations, 6WMF-Legal: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1573134 (10RobH) 5Open>3Resolved I'm resolving this task, since its no longer going to get setup on our cluster and instead migrated to wordpress. As such, the setup task (this one) for our side is inva... [20:34:42] !log tgr@tin Synchronized wmf-config/CommonSettings.php: make OAuth DB writable again T108648 (duration: 00m 12s) [20:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:09] ejegg: sure I can help you out, whats your phab username [20:36:24] chasemp: Thanks! phab username is also ejegg [20:36:33] tricky [20:36:37] heh [20:36:57] akosiaris, bblack, i will write an email to everyone involved (including greg-g), and we could discuss it there - should be faster to get started. And ideally we could talk tomorrow early (to allow akosiaris to participate) [20:37:05] just meaningless enough to be available nearly everywhere [20:37:11] mutante: where's that from? The logstash service in labs is at https://logstash-beta.wmflabs.org/ (fronted by the Labs proxy service) [20:37:11] ejegg: done [20:37:25] thanks again chasemp! [20:38:43] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1573141 (10RobH) @Slaporte has emailed me out of band for the discussion of the actual USD cost of the SSL certificate T110197. So we know the following: * no data to migrate * legal will cover th... [20:41:20] bd808: logstash1001 in prod [20:41:59] mutante: heh. bad configuration I put into puppet somewhere then. I'll see if I can find it [20:42:22] the prod instances are fronted by the misc varnish cluster [20:44:01] (03PS1) 10RobH: setting policy.wikimedia.org to a 5m ttl [dns] - 10https://gerrit.wikimedia.org/r/233837 [20:44:02] 6operations, 6Discovery, 7Elasticsearch: Investigate tweaking of the "wait for me" parameter for upgrades / restarts - https://phabricator.wikimedia.org/T109091#1573147 (10chasemp) https://www.elastic.co/guide/en/elasticsearch/reference/1.7/delayed-allocation.html [20:45:07] (03PS1) 10BryanDavis: Kibana: Fix vhost name in hiera [puppet] - 10https://gerrit.wikimedia.org/r/233838 [20:45:14] mutante: ^ [20:45:31] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1573159 (10RobH) Additionally, I've changed the TTL on the DNS entry for policy.wikimedia.org from 1 hour to 5 minutes. This will take up to an hour to propagate across the web, but once that is d... [20:45:49] (03CR) 10RobH: [C: 032] setting policy.wikimedia.org to a 5m ttl [dns] - 10https://gerrit.wikimedia.org/r/233837 (owner: 10RobH) [20:45:50] bd808: :) thx [20:48:57] (03CR) 10BryanDavis: [C: 031] "I've wanted to do this a zillion times but I never think about it except when I'm making another config change here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233781 (owner: 10Awight) [20:54:19] !log finished OAuth migration [20:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:56:44] tgr: awesome! [20:58:47] tgr: yay :D [21:04:22] (03CR) 10Ori.livneh: [C: 031] diamond: nutcracker collector improvements [puppet] - 10https://gerrit.wikimedia.org/r/230259 (owner: 10Rush) [21:09:52] !log ori@tin Synchronized php-1.26wmf20/extensions/AbuseFilter: I15f5b5b6 & I9c23b607 (duration: 00m 14s) [21:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:39] !log ori@tin Synchronized php-1.26wmf19/extensions/Cite/modules/ext.cite.styles.css: 7344e02216: Updated mediawiki/core Project: mediawiki/extensions/Cite (duration: 00m 12s) [21:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:50] !log ori@tin Synchronized php-1.26wmf19/extensions/AbuseFilter: I15f5b5b6 & I9c23b607 (duration: 00m 13s) [21:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:17:24] (03PS1) 10Thcipriani: Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 [21:23:33] YuviPanda: this looks fun: https://github.com/etsy/morgue [21:23:49] it's etsy's postmortem tool [21:24:41] (03PS4) 10Ori.livneh: Revert "Revert of Iab860b8a5: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php" [puppet] - 10https://gerrit.wikimedia.org/r/184637 (owner: 10Anomie) [21:24:43] (03PS5) 10Faidon Liambotis: (WIP) Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 [21:24:58] (03PS6) 10Faidon Liambotis: Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 [21:25:00] paravoid: let's not be hasty [21:25:09] ori: ? [21:25:14] just trolling :P [21:25:31] (03CR) 10Faidon Liambotis: [C: 04-2] "(not until sodium is gone)" [puppet] - 10https://gerrit.wikimedia.org/r/179888 (owner: 10Faidon Liambotis) [21:26:15] (03CR) 10Ori.livneh: [C: 032] Revert "Revert of Iab860b8a5: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php" [puppet] - 10https://gerrit.wikimedia.org/r/184637 (owner: 10Anomie) [21:29:43] PROBLEM - puppet last run on nembus is CRITICAL Puppet last ran 6 hours ago [21:29:48] (03PS1) 10Faidon Liambotis: Replace Package['git-core'] with Package['git'] [puppet] - 10https://gerrit.wikimedia.org/r/233853 [21:31:32] 7Blocked-on-Operations, 6operations, 5Patch-For-Review: Install nodejs, nginx and other dependencies on francium - https://phabricator.wikimedia.org/T94457#1573332 (10Nitingupta910) are there any updates on making HTML dumps generally available? [21:32:26] (03CR) 10Merlijn van Deen: [C: 04-1] "As far as I can see, there's also four labs hosts still on lucid:" [puppet] - 10https://gerrit.wikimedia.org/r/179888 (owner: 10Faidon Liambotis) [21:32:31] (03PS1) 10Andrew Bogott: Add labnet1002 hiera host file [puppet] - 10https://gerrit.wikimedia.org/r/233854 [21:32:33] (03PS1) 10Andrew Bogott: Switch labs controller to openstack juno [puppet] - 10https://gerrit.wikimedia.org/r/233855 [21:32:35] (03PS1) 10Andrew Bogott: Move labnet1001 and 1002 to openstack Juno [puppet] - 10https://gerrit.wikimedia.org/r/233856 [21:32:37] (03PS1) 10Andrew Bogott: Move labvirt1005 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/233857 [21:32:40] (03PS1) 10Andrew Bogott: Move holmium/designate to openstack Juno [puppet] - 10https://gerrit.wikimedia.org/r/233858 [21:32:41] (03PS1) 10Andrew Bogott: Move californium/Horizon to openstack Juno [puppet] - 10https://gerrit.wikimedia.org/r/233859 [21:32:47] (03CR) 10Faidon Liambotis: [C: 04-2] "not until sodium is gone" [puppet] - 10https://gerrit.wikimedia.org/r/233853 (owner: 10Faidon Liambotis) [21:32:53] (03PS1) 10Gergő Tisza: Exclude Wiki Education Foundation dashboard IP from rate limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233860 (https://phabricator.wikimedia.org/T110235) [21:36:13] (03CR) 10Yuvipanda: "Our labs puppet code hasn't supported lucid for a long time - they are unsshable for quite a while now (years?)." [puppet] - 10https://gerrit.wikimedia.org/r/179888 (owner: 10Faidon Liambotis) [21:37:49] (03PS5) 10Andrew Bogott: Add labs config files for Openstack version Juno [puppet] - 10https://gerrit.wikimedia.org/r/192483 [21:38:25] (03CR) 10Merlijn van Deen: Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 (owner: 10Faidon Liambotis) [21:39:15] (03PS2) 10Gergő Tisza: Exclude Wiki Education Foundation dashboard IP from rate limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233860 (https://phabricator.wikimedia.org/T110235) [22:01:42] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 13 data above and 0 below the confidence bounds [22:01:50] greg-g: parsoid needs to deploy a cherry-pick to make the RESTBase people happy. [22:02:05] greg-g: according to the deploy schedule, tgr is finishing up the oauth deploy right now? [22:03:12] cscott: greg-g is chatting IRL [22:03:31] ohai [22:03:35] is tgr nearby IRL? [22:03:35] cscott: go ahead [22:03:37] YuviPanda: poke ? [22:03:44] just trying to make sure i'm not stepping on oauth toes. [22:03:55] tgr should be done, he !log that the oauth transition is complete [22:04:08] sorry, robla and I were chatting [22:04:10] :) [22:05:34] PROBLEM - puppet last run on neon is CRITICAL Puppet has 1 failures [22:05:57] (03PS1) 10BryanDavis: Logstash: make sure all input defines deal with ferm [puppet] - 10https://gerrit.wikimedia.org/r/233866 [22:10:47] cscott: yeah, oauth is done [22:13:49] tgr: thanks. [22:13:57] (03CR) 10BBlack: [C: 04-1] base::service_unit: ship systemd units in /lib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233626 (owner: 10Alexandros Kosiaris) [22:14:37] (03PS3) 10BBlack: Disable IPSec monitoring temporarily [puppet] - 10https://gerrit.wikimedia.org/r/233616 (https://phabricator.wikimedia.org/T110065) [22:15:04] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [22:15:33] (03CR) 10BBlack: [C: 032] Disable IPSec monitoring temporarily [puppet] - 10https://gerrit.wikimedia.org/r/233616 (https://phabricator.wikimedia.org/T110065) (owner: 10BBlack) [22:20:16] !log updated Parsoid to version c3b037b0 [22:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:39] (03CR) 10Faidon Liambotis: [C: 031] Switch codfw to tier2 [puppet] - 10https://gerrit.wikimedia.org/r/233438 (https://phabricator.wikimedia.org/T110065) (owner: 10BBlack) [22:25:56] legoktm, shall we just do https://gerrit.wikimedia.org/r/#/c/233665/ ? [22:26:18] I suppose so [22:26:34] Krenair: maybe we should wait for Mxn since he's the current active maintainer of the portals [22:26:53] to put on syntax highlighting? [22:28:02] you never know what breaks people's workflows ;P [22:28:10] I added him to the reviewer list [22:28:54] obligatory reference: https://xkcd.com/1172/ [22:29:30] (03CR) 1020after4: [C: 031] Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 (owner: 10Thcipriani) [22:29:40] exactly :) [22:29:53] :) [22:32:07] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:39:01] legoktm, Platonides: Ugh. https://phabricator.wikimedia.org/T110248 is asking to just rename a namespace [22:39:03] as in, the simple way [22:39:33] and have a bot go around and update 2000 links [22:40:05] instead of creating the new namespace, moving the pages into it, and removing the old [22:50:53] Krenair: eh? just alias the old name? [22:51:44] uh, right [22:52:01] that does seem like the proper way to do it [22:52:06] MatmaRex, good idea :p [22:52:30] at your service [22:56:27] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 280817 msg: ocg_render_job_queue 3010 msg (=3000 critical) [22:56:38] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 281202 msg: ocg_render_job_queue 3231 msg (=3000 critical) [22:56:57] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 281531 msg: ocg_render_job_queue 3429 msg (=3000 critical) [22:57:11] (03PS1) 10Alex Monk: Rename Facoltà namespace to Area on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233872 (https://phabricator.wikimedia.org/T110248) [22:57:46] (03CR) 10Ori.livneh: "@Thcipriani: did you see my comments on the previous change-set?" [puppet] - 10https://gerrit.wikimedia.org/r/233850 (owner: 10Thcipriani) [22:58:48] (03PS6) 10Thcipriani: Add servicedeploy user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [22:58:50] (03PS2) 10Thcipriani: Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 [22:59:27] ragesoss, what sort of rate limiting is in place on wikiedu's side? [22:59:28] (03CR) 10Ori.livneh: "Yes, yes you did :)" [puppet] - 10https://gerrit.wikimedia.org/r/233850 (owner: 10Thcipriani) [22:59:58] Do you really only need it on the English Wikipedia? [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150825T2300). Please do the needful. [23:00:04] bd808 tgr legoktm: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:24] matt_flaschen: Didn't you have a patch you wanted to SWAT? [23:00:47] (03PS4) 10BBlack: Switch codfw to tier2 [puppet] - 10https://gerrit.wikimedia.org/r/233438 (https://phabricator.wikimedia.org/T110065) [23:00:52] Krenair: there's no rate limiting on wikiedu's side right now. And currently, our system only edits on English Wikipedia. That's likely to remain the case for a while. [23:00:53] RoanKattouw, yeah, your reply fix. I'll add it, thanks. [23:00:58] o/ [23:02:47] o/ [23:02:59] (03CR) 10Ori.livneh: [C: 031] Create ssh-agent-proxy internal permissions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/233850 (owner: 10Thcipriani) [23:03:01] (03PS3) 10Thcipriani: Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 [23:03:07] you've reviewed this all already bd808? [23:03:34] Krenair: the logging change? yeah. It looks to just be a sort of the existing code [23:03:47] thcipriani: I'd be happy to merge that once you tell me it's ready. [23:04:00] I was just too lazy to merge and sync midday [23:04:16] Added the Flow one. [23:04:22] (03CR) 10BBlack: [C: 032] Switch codfw to tier2 [puppet] - 10https://gerrit.wikimedia.org/r/233438 (https://phabricator.wikimedia.org/T110065) (owner: 10BBlack) [23:04:26] (03CR) 10Alex Monk: [C: 032] Sort log streams alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233781 (owner: 10Awight) [23:04:28] (03PS1) 10Dzahn: mailman: apply list role on fermium [puppet] - 10https://gerrit.wikimedia.org/r/233873 [23:04:56] (03Merged) 10jenkins-bot: Sort log streams alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233781 (owner: 10Awight) [23:04:58] ori: kk, I missed a couple of weird things that I just pushed a patch for, reviewing your last comments and then it should be ready to go after I push those changes. [23:05:04] (03PS2) 10Dzahn: mailman: apply list role on fermium [puppet] - 10https://gerrit.wikimedia.org/r/233873 (https://phabricator.wikimedia.org/T109925) [23:05:06] bd808, great [23:05:51] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/233781/ (duration: 00m 12s) [23:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:00] (03CR) 10CSteipp: "Assuming that IP is entirely managed by wikied (rdns is dashboard.wikiedu.org, so seems likely), and they're taking reasonable precautions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233860 (https://phabricator.wikimedia.org/T110235) (owner: 10Gergő Tisza) [23:06:18] (03CR) 10John F. Lewis: [C: 031] "Progress!" [puppet] - 10https://gerrit.wikimedia.org/r/233873 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [23:07:00] Krenair: https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki looks good, no spikes or dips [23:07:01] thcipriani: if you're amending, I'd suggest moving lines 172-176 out of global scope and into a function (not a method of SshAgentProxyHandler, just a free-standing function) [23:07:08] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [23:07:28] PROBLEM - puppet last run on cp1065 is CRITICAL puppet fail [23:07:37] (03CR) 10Alex Monk: [C: 032] Use wfLoadExtension() directly for loading some extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232962 (owner: 10Legoktm) [23:07:42] thanks [23:07:50] ori: yeah, I can do that while I'm in here. [23:08:04] (03Merged) 10jenkins-bot: Use wfLoadExtension() directly for loading some extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232962 (owner: 10Legoktm) [23:09:34] (03PS1) 10BBlack: Bugfix for fff9aca5: s/concat/array_concat/ [puppet] - 10https://gerrit.wikimedia.org/r/233874 [23:10:02] (03CR) 10BBlack: [C: 032 V: 032] Bugfix for fff9aca5: s/concat/array_concat/ [puppet] - 10https://gerrit.wikimedia.org/r/233874 (owner: 10BBlack) [23:10:02] !log krenair@tin Synchronized wmf-config/extension-list: https://gerrit.wikimedia.org/r/#/c/232962/ (duration: 00m 12s) [23:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:29] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/232962/ (duration: 00m 12s) [23:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:18] RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [23:12:07] (03CR) 10Alex Monk: [C: 032] Use wfLoadSkin(s) to load all skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232963 (owner: 10Legoktm) [23:12:13] (03Merged) 10jenkins-bot: Use wfLoadSkin(s) to load all skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232963 (owner: 10Legoktm) [23:12:46] !log krenair@tin Synchronized wmf-config/extension-list: https://gerrit.wikimedia.org/r/#/c/232963/ (duration: 00m 12s) [23:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:59] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:13:16] \o/ [23:13:17] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/232963/ (duration: 00m 12s) [23:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:47] bblack, can you review the cookie change at https://gerrit.wikimedia.org/r/#/c/230924/ ? I'm pretty sure it doesn't have any of the special strings, but just double-checking (which I think is the procedure). [23:14:23] Krenair: all skins still work :D [23:14:27] yay [23:14:34] 7Blocked-on-Operations, 10Flow, 3Collaboration-Team-Current, 5Patch-For-Review, 7WorkTypeNewFunctionality: Opt-in: Guided tour on user talk for first visit to new Flow board - https://phabricator.wikimedia.org/T108266#1573666 (10Mattflaschen) Can ops double-check the cookie name (https://gerrit.wikimedia... [23:14:59] (03CR) 10Alex Monk: [C: 032] Rename Facoltà namespace to Area on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233872 (https://phabricator.wikimedia.org/T110248) (owner: 10Alex Monk) [23:15:25] (03Merged) 10jenkins-bot: Rename Facoltà namespace to Area on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233872 (https://phabricator.wikimedia.org/T110248) (owner: 10Alex Monk) [23:15:59] matt_flaschen: just to make sure I'm understanding the patch right, the only cookie is the exact name "Flow_optIn_guidedTour"? [23:16:03] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/233872/ (duration: 00m 13s) [23:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:16] bblack, yes. No other new cookies. [23:16:53] It also reuses the standard GuidedTour cookie, but that's been around a long time. [23:16:59] ok [23:18:34] (03CR) 10Alex Monk: [C: 04-1] "I asked what sort of rate limiting was in place on wikiedu's side, but apparently there isn't any." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233860 (https://phabricator.wikimedia.org/T110235) (owner: 10Gergő Tisza) [23:19:48] PROBLEM - puppet last run on cp2011 is CRITICAL Puppet has 1 failures [23:20:27] PROBLEM - puppet last run on cp2003 is CRITICAL Puppet has 1 failures [23:20:57] PROBLEM - puppet last run on cp2016 is CRITICAL Puppet has 1 failures [23:21:34] (03PS4) 10Thcipriani: Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 [23:21:38] PROBLEM - puppet last run on cp2010 is CRITICAL Puppet has 1 failures [23:22:25] ^ ignore those cp20xx failures [23:22:26] Krenair: the patch would give the wikiedu dashboard the same permissions any user gets four days after registration [23:23:09] doesn't that criteria also include having to make 10 edits? [23:23:57] PROBLEM - puppet last run on cp2019 is CRITICAL Puppet has 1 failures [23:23:58] it does; the point is, it doesn't give any permissions that an attacker could not get with trivial effort [23:25:04] csteipp, is that okay with you? [23:25:49] PROBLEM - puppet last run on cp2015 is CRITICAL Puppet has 1 failures [23:26:00] Krenair: From what I know of the situation, yeah, I'm ok with it. [23:27:47] csteipp, do you agree that this should be moved to using TrustedXFF? [23:28:24] (when possible. not right now) [23:29:05] (03PS5) 10Thcipriani: Create ssh-agent-proxy internal permissions [puppet] - 10https://gerrit.wikimedia.org/r/233850 [23:29:18] ori: the ssh-agent-proxy patch should be ready now. ^ [23:30:18] PROBLEM - puppet last run on cp2002 is CRITICAL Puppet has 1 failures [23:30:22] Krenair: Since it's doing some bot-like work, I'm not sure about that. Aiui, at least part of it will be making updates when there's no active user on the dashboard. So it's less proxy and more bot. [23:30:27] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1573697 (10Dzahn) items from meeting: "double check for mailman cronjobs" "how long does rsync take" (for the final run, both) "where does mailman store listinfo info" "also tell ops list" "sto... [23:30:47] 7Blocked-on-Operations, 10Flow, 3Collaboration-Team-Current, 5Patch-For-Review, 7WorkTypeNewFunctionality: Opt-in: Guided tour on user talk for first visit to new Flow board - https://phabricator.wikimedia.org/T108266#1573701 (10Mattflaschen) Cookie name reviewed. [23:30:53] thcipriani: lgtm. have you tested it in beta? [23:31:04] csteipp, wait, so it edits on behalf of users like a bot? [23:31:08] or it just runs separate bots? [23:31:20] ori: just staging, lemme pull it over to beta [23:31:28] (03PS5) 10Rush: diamond: nutcracker collector improvements [puppet] - 10https://gerrit.wikimedia.org/r/230259 [23:31:55] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [23:32:02] ragesoss, ^ [23:32:15] 6operations, 10Wikimedia-Mailing-lists: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1573705 (10Dzahn) running rsync of all archives to new fermium, with --dry-run [23:32:23] Krenair: Just edits kindof like a bot [23:32:27] Not running a bot [23:33:19] (03CR) 10Rush: [C: 032] diamond: nutcracker collector improvements [puppet] - 10https://gerrit.wikimedia.org/r/230259 (owner: 10Rush) [23:33:32] Krenair: https://phabricator.wikimedia.org/T110235#1573327 [23:33:50] sigh. if I could get to deployment-puppetmaster [23:34:08] I agree that the non-bot-like part should eventually use XFF [23:34:18] not sure about scheduled actions [23:34:52] at any rate, if it causes serious problems, it can be blocked easily since it uses a single IP [23:35:11] I assume the instructor will be authorising the site to edit under their own account [23:35:25] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [23:35:55] yes, but if you enable autoblocks that will still block the whole dashboard for a day [23:36:13] yeah [23:36:14] PROBLEM - puppet last run on cp2004 is CRITICAL Puppet has 1 failures [23:36:18] which is enough time to find an oauthadmin and revoke the app permissions [23:36:30] so IMO this is low-risk [23:36:35] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [23:36:36] RECOVERY - puppet last run on cp2002 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:36:36] RECOVERY - puppet last run on cp2011 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [23:36:44] Let's do this then, perhaps we should get an email sent to checkuser-l? [23:36:45] and in the long term we should look for a better solution [23:36:51] greg-g, we need to organize a talk about making maps a "tear 2" service (per a long discussion with Mark and others), but you have a big offsite for the rest of the week, and i will be gone the next week. I'm inviting akosiaris & bblack to it, and a few more people. Would you mind not to be there and talk over email? [23:36:55] RECOVERY - puppet last run on cp2010 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [23:37:18] or does it already show that wikiedu is responsible for these edits? [23:37:32] (03CR) 10Alex Monk: [C: 032] "per discussion in -operations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233860 (https://phabricator.wikimedia.org/T110235) (owner: 10Gergő Tisza) [23:37:38] greg-g, i will email what we would like to achieve and will welcome and thoughts and suggestions [23:37:41] enwiki has some system message about special IPs, I'll make a request there [23:37:44] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:37:45] RECOVERY - puppet last run on cp2019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:37:55] RECOVERY - puppet last run on cp2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:37:56] (03Merged) 10jenkins-bot: Exclude Wiki Education Foundation dashboard IP from rate limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233860 (https://phabricator.wikimedia.org/T110235) (owner: 10Gergő Tisza) [23:37:58] I can also write to checkuser-l if you think that's useful [23:38:09] actually as this is an enwiki-only thing, probably not. [23:38:14] RECOVERY - puppet last run on cp2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:38:15] RECOVERY - puppet last run on cp2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:39:02] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/233860/ (duration: 00m 12s) [23:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:08] tgr, ragesoss: ^ [23:39:23] yurik: if brandon and alex are there, I'm fine with a summary of what was discussed/decisions made [23:39:36] greg-g, sounds good, thx [23:39:39] and yeah, wee two day day-long offsites! [23:39:41] :) [23:39:53] Krenair: awesome, thanks! [23:40:10] thcipriani: why cna't you? [23:40:31] can't [23:40:32] Krenair: csteipp: tgr: happy to work towards a better long-term solution on the wikiedu end. [23:40:35] PROBLEM - puppet last run on cp2023 is CRITICAL Puppet has 1 failures [23:40:54] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:41:06] PROBLEM - puppet last run on cp2008 is CRITICAL Puppet has 1 failures [23:41:16] ori: looks like it was a casualty of the labvirt1007 reboot maybe. Not sure why there haven't been alerts: no box can reach it. Trying a reboot. [23:41:22] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1573728 (10JohnLewis) qfiles; this can be handled two ways. We could stop mailman with exim and rsync them or we can hold exim and let mailman run for a defined period of time (10 minutes to be... [23:41:25] * ori nods [23:41:36] PROBLEM - puppet last run on cp2021 is CRITICAL Puppet has 1 failures [23:42:00] (03PS1) 10BBlack: Revert "Disable IPSec monitoring temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/233879 [23:42:07] (03PS2) 10BBlack: Revert "Disable IPSec monitoring temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/233879 [23:42:16] (03CR) 10BBlack: [C: 032 V: 032] Revert "Disable IPSec monitoring temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/233879 (owner: 10BBlack) [23:43:35] RECOVERY - puppet last run on cp2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:44:35] RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:45:28] ori: well, shoot. Let's postpone this. Looks like deployment-puppetmaster may have some disk corruption :( [23:46:35] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 313213 msg: ocg_render_job_queue 384 msg [23:47:45] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 313408 msg: ocg_render_job_queue 0 msg [23:47:55] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 313451 msg: ocg_render_job_queue 0 msg [23:48:09] 6operations, 7Monitoring: Collect and report nutcracker statistics to Ganglia and/or Graphite - https://phabricator.wikimedia.org/T107381#1573729 (10chasemp) a:5chasemp>3ori We now have some solid statistics in graphite. I'm going to vote we don't duplicate them in ganglia. I"m not sure what @ori thinks... [23:51:05] RECOVERY - puppet last run on cp2008 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:51:49] ragesoss: https://en.wikipedia.org/wiki/MediaWiki_talk:Blockiptext#WikiEdu_dashboard [23:51:59] in case you want to drop a contact address [23:54:01] cscott, shall we do https://gerrit.wikimedia.org/r/#/c/233439/ now ? [23:54:02] tgr: cool. left a comment on-wiki so people know where to follow up. [23:54:08] (03CR) 10Alex Monk: [C: 031] Always use VRS to configure Visual Editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233439 (owner: 10Cscott) [23:59:47] cscott, oh and https://gerrit.wikimedia.org/r/#/c/200038/ [23:59:54] (03CR) 10Alex Monk: "Bump." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200038 (owner: 10Cscott)