[01:00:29] 06Operations, 13Patch-For-Review: install font packages on all appservers, not just imagescalers (was: Install fonts-wqy-zenhei on all mediawiki app servers) - https://phabricator.wikimedia.org/T84777#2432252 (10Cwek) >>! In T84777#2362923, @Dzahn wrote: > @Cwek The thing is that currently all the fonts are ju... [01:09:25] (03PS3) 10Dzahn: WIP: Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [01:21:50] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1301.49 seconds [01:24:43] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1766457 (10AlexMonk-WMF) promethium already got allocated elsewhere (to the wikitextexp project), right? Or was that for this purpose? [01:26:47] (03PS4) 10Dzahn: Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [01:28:03] (03PS5) 10Dzahn: Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [01:34:40] (03PS6) 10Dzahn: Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [01:35:46] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [01:41:01] (03PS7) 10Dzahn: Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [01:41:59] (03PS8) 10Dzahn: Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [01:42:22] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:42:57] 06Operations, 13Patch-For-Review: install font packages on all appservers, not just imagescalers (was: Install fonts-wqy-zenhei on all mediawiki app servers) - https://phabricator.wikimedia.org/T84777#2432316 (10Dzahn) @Muehlenhoff Do you know? [01:44:41] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [01:53:51] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [02:25:51] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.8) (duration: 09m 16s) [02:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:31] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.9) (duration: 17m 48s) [03:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jul 6 03:09:17 UTC 2016 (duration 6m 53s) [03:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:57:49] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: puppet fail [04:24:38] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [04:28:58] PROBLEM - Hadoop DataNode on analytics1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [04:31:18] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqiad:xe-4/2/0 (Telia, IC-307235, 34ms) {#10693} [10Gbps wave]BR [04:31:27] RECOVERY - Hadoop DataNode on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [04:33:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [06:31:22] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:44] The hadoop datanode issue is a Java OOM that we are investigating, probably I'll need to increase the JVM heap size today [06:32:12] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:21] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:42] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:42] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:41] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:36] (03PS2) 10Muehlenhoff: role::kafka::main::broker: Use DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297409 [06:55:52] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:42] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:11] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:10:22] 06Operations, 10DBA: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#2432494 (10jcrespo) 05Open>03Resolved All phlegal* databases were dropped from m3 and archived temporarily on es2002. [07:31:11] (03PS1) 10Jcrespo: Fail commons db servers back to its original configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297551 (https://phabricator.wikimedia.org/T139346) [07:51:17] !log restarted hhvm on mw1148 [07:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:51:33] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.587 second response time [07:51:42] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 67362 bytes in 7.231 second response time [07:54:35] (03PS2) 10Jcrespo: Fail commons db servers back to its original configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297551 (https://phabricator.wikimedia.org/T139346) [07:55:47] (03PS3) 10Jcrespo: Fail commons db servers back to its original configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297551 (https://phabricator.wikimedia.org/T139346) [07:56:49] (03CR) 10Jcrespo: [C: 032] Fail commons db servers back to its original configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297551 (https://phabricator.wikimedia.org/T139346) (owner: 10Jcrespo) [07:57:34] !log rolling reboot of sca clusters for kernel update [07:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:58:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Fail commons db servers back to its original configuration (duration: 00m 45s) [07:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:58:52] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2432534 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:59:55] (03PS6) 10Elukey: Include a cassandra::instance::monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/295125 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [08:00:20] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2432535 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:00:43] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2432536 (10mobrovac) >>! In T138561#2430064, @Mholloway wrote: > As it happens, I've been running 4.4.6 on my dev machine for a while now and so far as I've se... [08:01:11] godog, mobrovac, gehel: fyi merging https://gerrit.wikimedia.org/r/295125 [08:01:36] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2432537 (10Tgr) Five users are stuck: ``` mysql:wikiadmin@db1079 [centralauth]> select ru_oldname, ru_newname, count(*) from renameuser_status... [08:01:42] elukey: Thanks! [08:02:13] (03CR) 10Elukey: [C: 032] Include a cassandra::instance::monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/295125 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [08:03:34] elukey: hm, why do both role::aqs and role::cassandra have the same monitoring declaration when aqs includes cassandra? [08:03:53] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:05:12] mobrovac: o/ aqs includes the cassandra module that does not have anymore monitoring, Nicko moved it to a monitoring class [08:05:24] not sure if I am missing something [08:05:34] (03PS2) 10Urbanecm: Enable autopatrolled user group at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297283 (https://phabricator.wikimedia.org/T139302) [08:05:34] oh ok [08:07:24] so basically each role now needs to add monitoring, it won't be enabled by default.. in this way everyone could configure the monitoring class as needed (for example, me and gehel are eager to add notifications to analytics and discovery as opposed to only ops and team-services) [08:09:07] 06Operations, 10Analytics-Cluster, 10Packaging: libcglib3-java replaces libcglib-java in Jessie - https://phabricator.wikimedia.org/T137791#2379016 (10MoritzMuehlenhoff) It's a bit strange, the source package has also changed and it appears there's two versions of that source in Debian stretch by now: https:... [08:11:28] !log stopping replication on db1056 and performing alter table [08:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:12:54] (03PS3) 10Urbanecm: Change Albanian Wikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297195 (https://phabricator.wikimedia.org/T139229) [08:13:49] (03PS2) 10Urbanecm: HD version for sqwikiquote's logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297196 (https://phabricator.wikimedia.org/T139229) [08:19:21] elukey: ok, so no changes are needed on our side to keep the monitoring for rb cass? [08:27:35] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2432583 (10Dereckson) The script uses `users_to_rename` table, according 42c451c commit message. It's a table used during SUL migration, see T7... [08:27:38] mobrovac: yep it should be all good, it was a no op.. I checked the alarms and they are all present, will re-check later on.. [08:27:50] kk thnx elukey! [08:31:54] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Database, 07Performance, and 2 others: Set mediawiki's read-only mode automatically when database masters are detected to have read_only=1 - https://phabricator.wikimedia.org/T135711#2432587 (10jcrespo) [08:32:37] !log powercycle ms-be2021, unreachable and nothing on console [08:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:38] (03Abandoned) 10Hashar: Revert "nodepool: lower # of instances" [puppet] - 10https://gerrit.wikimedia.org/r/297512 (https://phabricator.wikimedia.org/T139285) (owner: 10Hashar) [08:35:31] (03PS2) 10Gehel: Correct hiera property name to create postgresql users [puppet] - 10https://gerrit.wikimedia.org/r/297273 (https://phabricator.wikimedia.org/T138092) [08:36:23] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms [08:36:32] RECOVERY - MD RAID on ms-be2021 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [08:36:59] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 4 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#2432595 (10Tau) How do I check the http log channel? In /mediawiki/LocalSettings.php I have following lines: ``` ## To enable image uploa... [08:37:18] (03CR) 10Gehel: [C: 032] Correct hiera property name to create postgresql users [puppet] - 10https://gerrit.wikimedia.org/r/297273 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [08:38:22] RECOVERY - HP RAID on ms-be2021 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [08:46:37] !log lithium:~$ sudo lvextend --size +50G -t -r /dev/mapper/lithium--vg-syslog [08:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:48:23] RECOVERY - Disk space on lithium is OK: DISK OK [08:49:26] (03PS1) 10Thiemo Mättig (WMDE): Add Cape Verdean Creole (kea) as extra language for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297556 (https://phabricator.wikimedia.org/T127435) [08:53:38] (03CR) 10Gehel: "Tested and working on maps-scratch5. Need more review to see what changes are expected on other postgresql nodes now that users will be cr" [puppet] - 10https://gerrit.wikimedia.org/r/296551 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [08:56:28] (03CR) 10Hashar: "Hiera.debug() outputs to stdout which mangle the rspec output. The Hiera proxy backends have a bunch of Hiera.debug() calls. To reproduce:" [puppet] - 10https://gerrit.wikimedia.org/r/297133 (owner: 10Hashar) [08:58:24] (03CR) 10Elukey: "From the following link it seems that namenode_opts will take precedence over hadoop ones, but I am not sure about "HADOOP_HEAPSIZE sets t" [puppet] - 10https://gerrit.wikimedia.org/r/296899 (https://phabricator.wikimedia.org/T139071) (owner: 10Elukey) [09:03:32] (03PS1) 10Giuseppe Lavagetto: Add tox tests [software/service-checker] - 10https://gerrit.wikimedia.org/r/297557 [09:03:34] (03PS1) 10Giuseppe Lavagetto: Debianization [software/service-checker] - 10https://gerrit.wikimedia.org/r/297558 [09:05:54] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to deployment hosts (tin/terbium) for Brian Wolff - https://phabricator.wikimedia.org/T138635#2405970 (10MoritzMuehlenhoff) FYI, we didn't have an ops meeting on Monday, this will be discussed on next weeks's meeting. [09:06:07] (03PS2) 10Gehel: Maps - notify tilerator of new expire files [puppet] - 10https://gerrit.wikimedia.org/r/297471 [09:06:16] (03CR) 10Alexandros Kosiaris: "actually, is this required at all ? I mean the entire promethium stanza. The vlan assigned to the host (labs-instances1-b-eqiad) means DHC" [puppet] - 10https://gerrit.wikimedia.org/r/297534 (https://phabricator.wikimedia.org/T120262) (owner: 10Dzahn) [09:06:55] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 6 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2432613 (10Storkk) I don't think it's caused by similarity of names... The followin... [09:07:13] PROBLEM - HP RAID on ms-be2026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:08:23] 06Operations, 10Ops-Access-Requests, 06Discovery, 10Wikidata, and 2 others: Enable WDQS admins to enable/disable mask/unmask updater service - https://phabricator.wikimedia.org/T138627#2432614 (10Gehel) 05Open>03Resolved a:03Gehel No objections were raised, this is merged. [09:09:11] !log pooling into service the last batch of new API appservers - mw1284->mw1290 [09:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:23] RECOVERY - HP RAID on ms-be2026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [09:09:58] (03CR) 10Hashar: "recheck" [software/service-checker] - 10https://gerrit.wikimedia.org/r/297557 (owner: 10Giuseppe Lavagetto) [09:10:23] !log elukey@palladium conftool action : set/pooled=yes:weight=25; selector: mw1284.eqiad.wmnet [09:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:48] 06Operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 06Services: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2432626 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:10:52] sorry for the spam that I'll create but I prefer to add one server at the time :) [09:12:48] (03PS1) 10Jcrespo: Update s4 partitioning in preparation for db1056 pooling [software] - 10https://gerrit.wikimedia.org/r/297560 [09:12:59] !log elukey@palladium conftool action : set/pooled=yes:weight=25; selector: mw1285.eqiad.wmnet [09:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:24] <_joe_> elukey: use --quiet when you already logged the collective action [09:14:41] _joe_ ah ok sorry didn't know! thanks! [09:14:57] (03PS3) 10Gehel: Maps - notify tilerator of new expire files [puppet] - 10https://gerrit.wikimedia.org/r/297471 [09:15:04] <_joe_> no reason to be sorry [09:15:15] <_joe_> but well, I'm happy you're sorry, now you owe me one [09:15:55] oh noeesss :D [09:19:00] I know that I am probably asking a very naive question but is there any sort of "acceptable" threshold for errors logged by hhvm? I can see tons of spam in https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm [09:19:32] <_joe_> elukey: what do you mean? [09:19:55] <_joe_> elukey: so most errors come from 3 hosts [09:19:56] for example, "Notice: Undefined index: 0 in /srv/mediawiki/php-1.28.0-wmf.9/extensions/Translate/tag/TranslatablePage.php" is constantly spamming and it might hide real issues [09:20:03] <_joe_> we might want to scap pull there [09:20:07] <_joe_> good catch [09:20:16] <_joe_> (another thing we should really monitor, btw) [09:20:17] elukey: the log spam is manually tracked, partly by Releng [09:20:22] and we fill bugs against https://phabricator.wikimedia.org/tag/wikimedia-log-errors/ [09:20:33] hashar: thanks! [09:21:12] then releng / opener try to figure out what is the extension emitting the log, and loop in the developers / team [09:21:33] every tuesday when we deploy the train, there is a surge of new errors coming from the new branche [09:21:36] really interested, will try to take a look to the log-errors tag [09:21:59] and sometime we will mark the log spam as a blocker to continue the train on wednesday (which is most wikis except wikipedias) [09:24:06] elukey: and that one is definitely worth a bug / blocking [09:24:27] <_joe_> ok scap pull solved the issue on mw1256 AFAICT [09:24:42] elukey: I am filling the bug [09:24:46] _joe_ scap pull can be run without a depoo/pool right? [09:24:47] 06Operations, 10Monitoring: Monitor the BMC's event log for hardware errors - https://phabricator.wikimedia.org/T136311#2432633 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:24:50] <_joe_> uhm or not [09:24:54] *depool/pool [09:24:55] <_joe_> elukey: yes, but wait [09:24:57] <_joe_> both of you [09:25:00] ahahhah [09:25:07] nono I wanted to ask, not taking actions [09:25:12] (03CR) 10Jcrespo: [C: 032] Update s4 partitioning in preparation for db1056 pooling [software] - 10https://gerrit.wikimedia.org/r/297560 (owner: 10Jcrespo) [09:26:22] <_joe_> ok no, scap pull doesn't solve the issue [09:26:26] <_joe_> that's pretty strange btw [09:26:32] <_joe_> let me check another thing [09:26:47] (03PS2) 10Jcrespo: Disable crons using the phabricator db slave due to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/296877 (https://phabricator.wikimedia.org/T138460) [09:27:44] <_joe_> sigh, restarting hhvm solves the issue [09:27:57] <_joe_> so it seems some kind of tc corruption or something? [09:28:53] <_joe_> elukey: so I am going to debug this on mw1236 [09:28:58] <_joe_> if you want to take a look [09:29:07] <_joe_> I'll open a screen as root [09:29:07] elukey: filled the PHP notice spam at https://phabricator.wikimedia.org/T139447 [09:29:18] _joe_ sure thanks [09:30:31] hashar: super thanks! I the meantime, I was reading mw1236 [09:30:37] sorry https://wikitech.wikimedia.org/wiki/Deployments/One_week [09:30:39] :) [09:30:50] and apparently the notice is only on mw1215 mw1256 and mw1236 [09:30:55] the notices are only ... [09:31:07] !log installing tomcat security updates on Ubuntu systems (jessie already fixed) [09:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:42] so maybe those mw app server have been pooled with an incomplete config/ code base? [09:32:47] hashar: joe is working on them, it seems that restarting hhvm solves the issue.. [09:32:50] :/ [09:32:50] It started on mw1215 at 04:04:20 UTC. [09:34:01] <_joe_> sigh [09:34:14] <_joe_> fb changed the format of the sqlite3 files [09:34:17] <_joe_> apparently [09:34:26] <_joe_> they can't be opened from the command line [09:35:23] <_joe_> err, no, PEBKAC :P [09:35:57] _joe_: wrong executable name? :) [09:38:08] <_joe_> ema: sqlite vs sqlite3 [09:38:44] <_joe_> anyways, I just realized that the problem is probably in the tc cache somehow, and it's not new then [09:38:51] <_joe_> probably some issue with stat_cache [09:39:07] <_joe_> I am starting to think we should really start deploying code differently [09:39:54] <_joe_> !log restarting hhvm on mw1236,mw1215 to test for possible TC cache corruption [09:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:08] <_joe_> the strange part this time is [09:40:18] <_joe_> the issue appeared in the middle of nothing [09:40:32] looking at Translate the code is in a while(true) {} [09:40:42] so maybe that is a single process being stuck [09:40:47] <_joe_> nope [09:41:21] <_joe_> or well, might be, it wouldn't show in quickstack ofc [09:41:48] <_joe_> so your suggestion might be correct [09:42:08] <_joe_> but OTOH, why didn't hhvm kill the request in that case? we do have a request timeout [09:42:23] <_joe_> hashar: if you're right, it will happen again sooner rather than later [09:44:12] _joe_: the related code is mostly from 2009-2010, last touched in 2012 [09:44:14] we will see [09:44:19] (03CR) 10Jcrespo: [C: 032] Disable crons using the phabricator db slave due to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/296877 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [09:44:54] (03CR) 10Filippo Giunchedi: "thoughts?" [puppet] - 10https://gerrit.wikimedia.org/r/295513 (owner: 10Filippo Giunchedi) [09:46:31] the reason I suspect a code loop is because million of messages on wmf.9 doesn't make much sense since it is not receiving much traffic (only on group0) [09:47:23] then maybe the loop is caused because some hhvm cache got corrupted somehow :/ [09:50:38] (03PS2) 10Filippo Giunchedi: swift: redirect syslog from all daemons to separate file [puppet] - 10https://gerrit.wikimedia.org/r/294678 (https://phabricator.wikimedia.org/T137397) [09:52:16] (03CR) 10Filippo Giunchedi: [C: 04-1] "holding off on this change until graphite gets lvs, not to introduce bogus/unused records" [dns] - 10https://gerrit.wikimedia.org/r/289635 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [09:53:29] (03CR) 10ArielGlenn: [C: 031] "woops sorry about that. yes, 30gb should be fine now that /srv is on its own partition." [puppet] - 10https://gerrit.wikimedia.org/r/295513 (owner: 10Filippo Giunchedi) [09:56:10] (03CR) 10Jcrespo: "Question, last time I checked, I saw that the actual issue may be the log level (it seemed too high INFO?). Have you considered reducing i" [puppet] - 10https://gerrit.wikimedia.org/r/294678 (https://phabricator.wikimedia.org/T137397) (owner: 10Filippo Giunchedi) [09:59:44] _joe_: the Translate extension code looks suspect. It loops until preg_match_all() yields no matches ( return 0) but will keep looping if there is a match error (preg_match_all() returning false). [09:59:50] no clue what is the root cause though [10:04:49] apergos: thanks! [10:04:54] (03PS2) 10Filippo Giunchedi: install_server: smaller root for single-disk /srv [puppet] - 10https://gerrit.wikimedia.org/r/295513 [10:05:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: smaller root for single-disk /srv [puppet] - 10https://gerrit.wikimedia.org/r/295513 (owner: 10Filippo Giunchedi) [10:05:12] yw, I saw it initially and then forgot to comment, so thanks for the ping [10:07:45] be back post lunch [10:09:03] (03CR) 10Filippo Giunchedi: "The default log level is indeed a bit spammy because requests are logged for both frontend and backend services. OTOH it has been useful t" [puppet] - 10https://gerrit.wikimedia.org/r/294678 (https://phabricator.wikimedia.org/T137397) (owner: 10Filippo Giunchedi) [10:10:02] 06Operations, 10Traffic, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2432677 (10ema) >>! In T108827#2430297, @faidon wrote: > Are the TFO cookies shared across different ports and if so, is the use case you're thinking of the speed up of the initial... [10:14:14] !log upgrading restbase cluster in codfw for nodejs 4.4.6 [10:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:18:36] jynus: FYI I'm going ahead with https://gerrit.wikimedia.org/r/#/c/294678/ if I've answered your question? [10:18:56] yes [10:22:43] kk thanks! [10:23:22] (03PS3) 10Filippo Giunchedi: swift: redirect syslog from all daemons to separate file [puppet] - 10https://gerrit.wikimedia.org/r/294678 (https://phabricator.wikimedia.org/T137397) [10:23:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: redirect syslog from all daemons to separate file [puppet] - 10https://gerrit.wikimedia.org/r/294678 (https://phabricator.wikimedia.org/T137397) (owner: 10Filippo Giunchedi) [10:29:23] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2432685 (10MoritzMuehlenhoff) To anyone maintaining a service running on the scb* clusters; the scb servers in codfw have already been upgraded to nodejs 4.4.6... [10:33:02] (03PS1) 10Filippo Giunchedi: swift: more inclusive rsyslog matching [puppet] - 10https://gerrit.wikimedia.org/r/297569 (https://phabricator.wikimedia.org/T137397) [10:34:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: more inclusive rsyslog matching [puppet] - 10https://gerrit.wikimedia.org/r/297569 (https://phabricator.wikimedia.org/T137397) (owner: 10Filippo Giunchedi) [10:41:25] (03CR) 10Elukey: "Tested on cdh101.analytics.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/296899 (https://phabricator.wikimedia.org/T139071) (owner: 10Elukey) [10:45:51] (03PS4) 10Gehel: Maps - notify tilerator of new expire files [puppet] - 10https://gerrit.wikimedia.org/r/297471 [10:48:30] PROBLEM - Disk space on ms-be3004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdl1 70445 MB (3% inode=97%): /srv/swift-storage/sde1 88447 MB (4% inode=98%): /srv/swift-storage/sdj1 86503 MB (4% inode=98%): /srv/swift-storage/sdi1 77739 MB (4% inode=97%): /srv/swift-storage/sdk1 86101 MB (4% inode=98%): /srv/swift-storage/sdd1 85100 MB (4% inode=98%): /srv/swift-storage/sdf1 92634 MB (4% inode=98%): /srv/swift-storage/s [10:48:46] (03CR) 10Gehel: [C: 032] Maps - notify tilerator of new expire files [puppet] - 10https://gerrit.wikimedia.org/r/297471 (owner: 10Gehel) [10:54:47] (03PS1) 10Filippo Giunchedi: swift: adjust group ownership/perms for /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/297570 (https://phabricator.wikimedia.org/T137397) [10:57:39] (03PS2) 10Filippo Giunchedi: swift: adjust group ownership/perms for /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/297570 (https://phabricator.wikimedia.org/T137397) [10:57:54] ACKNOWLEDGEMENT - Disk space on ms-be3004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdl1 69816 MB (3% inode=97%): /srv/swift-storage/sde1 87914 MB (4% inode=98%): /srv/swift-storage/sdj1 85264 MB (4% inode=98%): /srv/swift-storage/sdi1 77739 MB (4% inode=97%): /srv/swift-storage/sdk1 83685 MB (4% inode=98%): /srv/swift-storage/sdd1 84497 MB (4% inode=98%): /srv/swift-storage/sdf1 92634 MB (4% inode=98%): /srv/swift-s [10:58:08] (03PS1) 10Jcrespo: Change dump-otrs.sh script permissions to 755 [puppet] - 10https://gerrit.wikimedia.org/r/297571 [10:59:14] (03CR) 10jenkins-bot: [V: 04-1] Change dump-otrs.sh script permissions to 755 [puppet] - 10https://gerrit.wikimedia.org/r/297571 (owner: 10Jcrespo) [10:59:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: adjust group ownership/perms for /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/297570 (https://phabricator.wikimedia.org/T137397) (owner: 10Filippo Giunchedi) [11:04:52] (03PS2) 10Giuseppe Lavagetto: Debianization [software/service-checker] - 10https://gerrit.wikimedia.org/r/297558 [11:09:02] (03PS1) 10ArielGlenn: allow list of jobs to run to be passed as argument for dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/297572 [11:10:25] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2432714 (10Cyberpower678) I understand there is a patch for review. Will this patch fix the renames getting stuck problem, or simply get the c... [11:10:49] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Puppet has 1 failures [11:11:03] that's me, ^ known [11:12:44] (03PS1) 10Filippo Giunchedi: swift: adjust group ownership for Ubuntu/Debian [puppet] - 10https://gerrit.wikimedia.org/r/297573 (https://phabricator.wikimedia.org/T137397) [11:13:20] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet has 1 failures [11:14:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: adjust group ownership for Ubuntu/Debian [puppet] - 10https://gerrit.wikimedia.org/r/297573 (https://phabricator.wikimedia.org/T137397) (owner: 10Filippo Giunchedi) [11:15:00] !log shutting down db1048 in preparation for upgrade [11:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:16:17] (03PS1) 10Yuvipanda: tools: Add a check for k8s backed webservices [puppet] - 10https://gerrit.wikimedia.org/r/297575 (https://phabricator.wikimedia.org/T131929) [11:16:40] (03PS2) 10Yuvipanda: tools: Add a check for k8s backed webservices [puppet] - 10https://gerrit.wikimedia.org/r/297575 (https://phabricator.wikimedia.org/T131929) [11:17:13] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add a check for k8s backed webservices [puppet] - 10https://gerrit.wikimedia.org/r/297575 (https://phabricator.wikimedia.org/T131929) (owner: 10Yuvipanda) [11:17:41] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:19:50] (03PS3) 10Muehlenhoff: role::kafka::main::broker: Use DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297409 [11:20:19] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:21:00] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [11:25:05] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2432757 (10Steinsplitter) >>! In T137973#2432714, @Cyberpower678 wrote: > I understand there is a patch for review. Will this patch fix the re... [11:26:15] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2432759 (10Cyberpower678) I would very much love to. I dedicate a few hours to review the entire queue. [11:28:13] dbproxy is normal and expected [11:28:24] while m3-slave is under maintenance [11:28:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] role::kafka::main::broker: Use DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297409 (owner: 10Muehlenhoff) [11:29:24] ACKNOWLEDGEMENT - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo m3-slave is down for upgrade [11:34:30] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Puppet has 1 failures [11:36:08] should recover by itself [11:36:37] (03CR) 10Filippo Giunchedi: [C: 04-1] "on hold for now" [puppet] - 10https://gerrit.wikimedia.org/r/289636 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [11:36:44] (03CR) 10Filippo Giunchedi: [C: 04-1] "on hold for now" [puppet] - 10https://gerrit.wikimedia.org/r/289637 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [11:36:50] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:38:18] (03CR) 10Filippo Giunchedi: [C: 031] role::labs::graphite: Further use of LABS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297361 (owner: 10Muehlenhoff) [11:42:50] (03PS1) 10ArielGlenn: move dependent recombine jobs into same invocation of dump script [puppet] - 10https://gerrit.wikimedia.org/r/297578 (https://phabricator.wikimedia.org/T139449) [11:43:27] (03CR) 10Filippo Giunchedi: "LGTM, why the $default_ prefix though? would $data_directory_base work ?" [puppet] - 10https://gerrit.wikimedia.org/r/297422 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [11:49:52] (03PS2) 10Muehlenhoff: role::labs::graphite: Further use of LABS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297361 [11:50:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] role::labs::graphite: Further use of LABS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297361 (owner: 10Muehlenhoff) [11:52:08] (03CR) 10Gehel: "$default_ prefix because it is only used for default instance, not for multi instance. But that prefix might be overkill..." [puppet] - 10https://gerrit.wikimedia.org/r/297422 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [11:52:49] godog: ^ that `default` prefix seemed to make sense to me, but I can remove it. [11:53:21] PROBLEM - cassandra-c service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:54:19] !log depooling mw1261.eqiad.wmnet to raise Apache's mod-fcgi to trace8 for 503 investigation - T73487 (this will probably slow down a bit the host) [11:54:20] T73487: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487 [11:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:57] gehel: oh ok, yeah I saw your comment on the commit message about other parameters following the same pattern and was wondering, I'd remove it since no other parameter has it [11:55:16] godog: make sense... keep consistency... I'll do that [11:55:21] godog: thanks ! [11:55:40] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 304 bytes in 0.068 second response time [11:56:02] ACKNOWLEDGEMENT - cassandra-c service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed Filippo Giunchedi bootstrap, T139362 [11:56:53] hmm [11:56:55] gehel: precisely, as you point out the module could use some refactory too heh [11:58:01] godog: That might be something Nicko would enjoy! Or that I might do in a second step. Good way to try a few idea and see what is in sync with our way of doing things... [11:58:41] ACKNOWLEDGEMENT - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 304 bytes in 0.089 second response time Yuvi Panda Switched puppetmaster and now the labs puppetmaster no longe rloves me. - The acknowledgement expires at: 2016-07-08 11:57:59. [11:58:56] gehel: nice, thanks! [11:58:58] * godog lunch [11:59:07] godog: bon apetit! [12:00:57] mw1261 back in service with fcgi set to trace8, all good from the logs. It is not going to spam logstash (Apache ErrorLog not going to syslog), hopefully I'll get some good info today. Let me know if you see any trouble [12:01:12] sadly this is the only way that I can reproduce a bug -.- [12:03:39] 06Operations, 06Services, 10cassandra, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2432839 (10mobrovac) p:05Triage>03High [12:05:10] (03PS1) 10Muehlenhoff: dumps: Restrict to PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297582 [12:08:45] (03PS1) 10Muehlenhoff: ocg: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297583 [12:09:43] !log restbase deploy start of fa4699a [12:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:10:57] (03PS2) 10Gehel: Ensure base cassandra directory is created. [puppet] - 10https://gerrit.wikimedia.org/r/297422 (https://phabricator.wikimedia.org/T138092) [12:11:32] (03CR) 10Gehel: "removed `default` prefix as suggested by Filippo" [puppet] - 10https://gerrit.wikimedia.org/r/297422 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [12:13:09] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Hardware request for codfw WDQS server - https://phabricator.wikimedia.org/T138637#2432856 (10Gehel) [12:13:48] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Deploy WDQS nodes on codfw - https://phabricator.wikimedia.org/T124862#2432857 (10Gehel) [12:14:21] Is WIkipedia down somewhere? [12:14:25] https://twitter.com/ElliotTurn/status/750664032194101248 [12:14:40] moritzm ^ [12:15:14] that seems like a local internet issue, Josve05a [12:15:51] ah ok [12:16:40] (03PS3) 10Gehel: Ensure base cassandra directory is created. [puppet] - 10https://gerrit.wikimedia.org/r/297422 (https://phabricator.wikimedia.org/T138092) [12:16:58] merci! [12:18:25] Josve05a: yeah, seems like local network issues, site is fine [12:18:53] RECOVERY - cassandra-c service on restbase1009 is OK: OK - cassandra-c is active [12:19:41] !log restbase deploy end of fa4699a [12:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:20:55] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:23:15] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:23:40] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/297242 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi) [12:28:58] 06Operations, 06Discovery, 06Maps, 10Maps-data: Maps - enable Geoshapes on production - https://phabricator.wikimedia.org/T138525#2432902 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:29:35] 06Operations, 10ops-eqiad: decom antimony (datacenter) - https://phabricator.wikimedia.org/T138978#2432904 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:29:41] (03PS2) 10Andrew Bogott: Provide labtest realm with it's own copy of network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/297463 (owner: 10Alexandros Kosiaris) [12:31:18] (03CR) 10Andrew Bogott: [C: 032] "Thanks Alex!" [puppet] - 10https://gerrit.wikimedia.org/r/297463 (owner: 10Alexandros Kosiaris) [12:32:36] moritzm: Appearently two people on twitter are having the same issue...odd... [12:33:42] 07Puppet, 06Labs: Puppet failing on labtest* due to slice_network_constants() - https://phabricator.wikimedia.org/T139387#2432927 (10Andrew) 05Open>03Resolved Seems better with that patch. Thanks! [12:33:44] 06Operations, 10ops-eqiad: decom antimony (datacenter) - https://phabricator.wikimedia.org/T138978#2432929 (10Peachey88) [12:34:18] Seems to ban issue with Optonline/Cablevision users [12:34:24] bee an* [12:34:26] ba* [12:34:43] (ok..my keyboard autocorrects to hossible corrections...) [12:34:43] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2432931 (10Andrew) Correct, the parsing folks are using Promethium. So we would need them to release it, or to rack an additional misc server here. [12:34:46] (03CR) 10Waldir: [C: 031] "LGTM, but for some reason the diff isn't properly colored on the Gerrit interface (the added text is green, but has no background). Is thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297556 (https://phabricator.wikimedia.org/T127435) (owner: 10Thiemo Mättig (WMDE)) [12:39:58] 06Operations, 06Services, 10cassandra, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2432948 (10mobrovac) Alternatively, would decommissioning rb1014-c help? [12:51:12] (03PS2) 10Jcrespo: Change dump-otrs.sh script permissions to 755 [puppet] - 10https://gerrit.wikimedia.org/r/297571 [12:51:33] (03PS3) 10Jcrespo: Change dump-otrs.sh script permissions to 755 [puppet] - 10https://gerrit.wikimedia.org/r/297571 [12:54:19] (03CR) 10Jcrespo: [C: 032] Change dump-otrs.sh script permissions to 755 [puppet] - 10https://gerrit.wikimedia.org/r/297571 (owner: 10Jcrespo) [13:01:26] (03PS1) 10Jcrespo: Change otrs dumps to use the fqdn, as it fails from codfw [puppet] - 10https://gerrit.wikimedia.org/r/297586 [13:01:38] (03PS2) 10Jcrespo: Change otrs dumps to use the fqdn, as it fails from codfw [puppet] - 10https://gerrit.wikimedia.org/r/297586 [13:02:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] foma: Initial Debian packaging [debs/contenttranslation/foma] - 10https://gerrit.wikimedia.org/r/295183 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [13:03:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] hfst-ospell: Initial Debian packaging [debs/contenttranslation/hfst-ospell] - 10https://gerrit.wikimedia.org/r/296231 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:04:42] (03CR) 10Thiemo Mättig (WMDE): "The file is extremely big and may cause Gerrit's client side diff to fail. Check it out via git if in doubt. Touching unrelated lines is a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297556 (https://phabricator.wikimedia.org/T127435) (owner: 10Thiemo Mättig (WMDE)) [13:08:14] (03CR) 10Jcrespo: [C: 032] Change otrs dumps to use the fqdn, as it fails from codfw [puppet] - 10https://gerrit.wikimedia.org/r/297586 (owner: 10Jcrespo) [13:08:39] PROBLEM - Disk space on elastic1040 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%) [13:08:48] PROBLEM - Disk space on elastic1034 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%) [13:09:23] damn it's happening again... ^ [13:09:39] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2433073 (10elukey) Performed the same test on mw1261, same result. I compared o... [13:15:49] !log truncating elastic main logs on elastic1040 and elastic1034 [13:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:40] !log restarting elastic master node (elastic1040) [13:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:09] RECOVERY - Disk space on elastic1040 is OK: DISK OK [13:18:19] RECOVERY - Disk space on elastic1034 is OK: DISK OK [13:21:17] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [13:25:10] (03PS1) 10Jcrespo: Preparing db1048 for jessie install [puppet] - 10https://gerrit.wikimedia.org/r/297588 (https://phabricator.wikimedia.org/T138460) [13:26:31] (03CR) 10jenkins-bot: [V: 04-1] Preparing db1048 for jessie install [puppet] - 10https://gerrit.wikimedia.org/r/297588 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [13:27:39] (03PS2) 10Jcrespo: Preparing db1048 for jessie install [puppet] - 10https://gerrit.wikimedia.org/r/297588 (https://phabricator.wikimedia.org/T138460) [13:28:07] godog, elukey: anything else you want to check on https://gerrit.wikimedia.org/r/#/c/297422/ before I merge? [13:30:21] (03CR) 10Elukey: [C: 031] "Puppet compiler looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/297422 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [13:30:38] elukey: thanks! [13:31:09] gehel: I didn't see the message, I was reviewing the CR, good timing :) [13:31:25] elukey: you do have a 6th sense... [13:31:57] I'd prefer to disable puppet on aqs/restbase just in case before merging, but I may be too paranoid [13:35:47] elukey: paranoid is good. Want me to do it? [13:35:47] (03CR) 10Jcrespo: [C: 032] Preparing db1048 for jessie install [puppet] - 10https://gerrit.wikimedia.org/r/297588 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [13:35:53] !log depooling mw1261.eqiad to restore previous fcgi logging settings (T73487) [13:35:54] T73487: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487 [13:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:02] gehel: sure [13:36:15] (03CR) 10ArielGlenn: [C: 031] dumps: Restrict to PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297582 (owner: 10Muehlenhoff) [13:38:52] !log disabling puppet on ^(aqs|restbase).* before merging changes to Cassandra puppet module [13:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:27] (03PS4) 10Gehel: Ensure base cassandra directory is created. [puppet] - 10https://gerrit.wikimedia.org/r/297422 (https://phabricator.wikimedia.org/T138092) [13:42:02] (03CR) 10Gehel: [C: 032] Ensure base cassandra directory is created. [puppet] - 10https://gerrit.wikimedia.org/r/297422 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [13:46:18] elukey: I confirmed that the change is a noop on aqs1001 and restbase 1014. Should I check another node? [13:47:28] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:47:47] gehel: good, I am happy with that :) [13:48:21] !log re-enabling puppet on ^(aqs|restbase).* after confirming that Cassandra puppet module change is a noop [13:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:48] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [13:53:17] (03PS1) 10Yuvipanda: tools: Provision star.tools.wmflabs.org cert for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/297591 (https://phabricator.wikimedia.org/T139461) [14:00:04] moritzm: Respected human, time to deploy wikitech maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160706T1400). Please do the needful. [14:02:08] !log rebooting silver (hosting wikitech) [14:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:15] (03CR) 10Ottomata: "Let's do it" [puppet] - 10https://gerrit.wikimedia.org/r/296899 (https://phabricator.wikimedia.org/T139071) (owner: 10Elukey) [14:02:20] ouch, that's gonna hurt [14:02:33] let's see what does not come up after silver reboot [14:02:39] (03PS1) 10Jcrespo: Remove /a directory from db1048 [puppet] - 10https://gerrit.wikimedia.org/r/297593 (https://phabricator.wikimedia.org/T138460) [14:05:10] kart_: around ? [14:05:19] giella-sme building fails [14:05:40] (03PS1) 10Yuvipanda: tools: Use provisioned cert instead of puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/297595 (https://phabricator.wikimedia.org/T139461) [14:06:19] (03CR) 10Jcrespo: [C: 032] Remove /a directory from db1048 [puppet] - 10https://gerrit.wikimedia.org/r/297593 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [14:06:25] gehel: (retroactively) LGTM! (re: cassandra code review) [14:06:36] godog: thanks! [14:07:21] (03CR) 10Alexandros Kosiaris: [V: 04-1] "package building fails. Details at https://phabricator.wikimedia.org/P3344" [debs/contenttranslation/giella-sme] - 10https://gerrit.wikimedia.org/r/294430 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [14:09:21] (03PS2) 10Filippo Giunchedi: install_server: pre-provision swift uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/297242 (https://phabricator.wikimedia.org/T123918) [14:09:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: pre-provision swift uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/297242 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi) [14:10:56] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [14:12:38] (03CR) 10Alexandros Kosiaris: "FWIW, I consider this ready for review. It should help solve most of the problems mentioned concerning ldaplist" [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris) [14:13:37] akosiaris: it's back up. silver is not that special, it has been rebooted for all previous trusty issues as well. the only gotcha is that mysql needs to be started manually (like the mysql daemons we use for the main databases) [14:14:13] ah yes. well.. I would have honestly forgotten about that [14:16:52] nfs-exports failed due to wikitech going down I bet [14:17:06] let me fix [14:17:19] is back [14:17:27] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:57] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: puppet fail [14:18:06] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [14:19:30] !log rebooting californium for kernel update (hosting horizon.wikimedia.org) [14:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:59] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433186 (10MoritzMuehlenhoff) p:05Triage>03Normal [14:22:25] hashar_: , yt? [14:22:51] (03PS4) 10Ottomata: Include librdkafka-dev in contint::packages::python [puppet] - 10https://gerrit.wikimedia.org/r/293273 (https://phabricator.wikimedia.org/T133779) [14:23:04] <_joe_> gehel: ^^ jmxtrans issues :D [14:25:04] (03CR) 10Ottomata: [C: 032] Include librdkafka-dev in contint::packages::python [puppet] - 10https://gerrit.wikimedia.org/r/293273 (https://phabricator.wikimedia.org/T133779) (owner: 10Ottomata) [14:25:45] _joe_: you know someone who knows something about jmxtrans? [14:26:11] <_joe_> gehel: I kinda remember that, yes [14:26:29] _joe_: I'll have a look... [14:26:43] <_joe_> gehel: I guess you're already aware about the issue btw [14:27:14] _joe_: there are lots of issues in jmxtrans. But I remember elukey talking to me about that one [14:28:10] (03CR) 10Giuseppe Lavagetto: [C: 032] Add tox tests [software/service-checker] - 10https://gerrit.wikimedia.org/r/297557 (owner: 10Giuseppe Lavagetto) [14:29:26] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [14:29:43] * elukey blames gehel for jmxtrans [14:29:56] :P [14:30:22] * gehel did not *create* jmxtrans, he just tried to make it behave [14:30:25] * _joe_ blames gehel for java [14:30:30] <_joe_> that's blaming ;) [14:31:17] 06Operations, 06Services, 10cassandra, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2433212 (10Eevans) >>! In T139362#2432838, @mobrovac wrote: > At least the cluster is still functional. So now we can either revert 1007 back to 2.1.13... [14:32:54] 06Operations, 06Services, 10cassandra, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2433215 (10fgiunchedi) I don't think decommissioning would help, data would move off to other instances in rack `d` including 1014 to make space even t... [14:34:50] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433216 (10Gehel) I'm doing a release of jmxtrans right now. This come with a few fixes to the stability of the graphite and statsd writers, including moving to a different res... [14:35:03] (03PS1) 10Eevans: Revert "Bootstrap Cassandra instance restbase1009-c" [puppet] - 10https://gerrit.wikimedia.org/r/297598 (https://phabricator.wikimedia.org/T139362) [14:39:37] (03CR) 10Filippo Giunchedi: [C: 031] "{{rubberstamp}}" [puppet] - 10https://gerrit.wikimedia.org/r/297598 (https://phabricator.wikimedia.org/T139362) (owner: 10Eevans) [14:39:57] (03PS1) 10Nschaaf: Change path to proxy node requests [puppet] - 10https://gerrit.wikimedia.org/r/297599 (https://phabricator.wikimedia.org/T134782) [14:40:16] urandom: I'll merge ^ [14:40:26] godog: thanks! [14:40:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Bootstrap Cassandra instance restbase1009-c" [puppet] - 10https://gerrit.wikimedia.org/r/297598 (https://phabricator.wikimedia.org/T139362) (owner: 10Eevans) [14:42:37] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:42:48] (03PS2) 10Yuvipanda: tools: Use provisioned cert instead of puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/297595 (https://phabricator.wikimedia.org/T139461) [14:42:50] (03PS2) 10Yuvipanda: tools: Provision star.tools.wmflabs.org cert for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/297591 (https://phabricator.wikimedia.org/T139461) [14:42:52] (03PS32) 10Yuvipanda: tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 (https://phabricator.wikimedia.org/T133999) [14:42:54] (03PS1) 10Yuvipanda: tools: Don't specify CA explicitly for client config [puppet] - 10https://gerrit.wikimedia.org/r/297600 (https://phabricator.wikimedia.org/T139461) [14:43:58] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:45:58] godog: any reason that wouldn't have 'taken', yet? [14:46:53] (03PS3) 10Elukey: Raise the Hadoop HDFS datanode heapsize to 2GB. [puppet] - 10https://gerrit.wikimedia.org/r/296899 (https://phabricator.wikimedia.org/T139071) [14:47:00] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2433243 (10KartikMistry) Any plan to upgrade beta cluster along with this upgrade? [14:47:17] PROBLEM - cassandra-c service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [14:48:20] urandom: mhh no it should have worked already, didn't do anything? [14:48:37] ACKNOWLEDGEMENT - cassandra-c service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Disabling. [14:48:41] godog: not that i can see, no [14:49:47] RECOVERY - Disk space on stat1002 is OK: DISK OK [14:51:00] (03CR) 10Elukey: [C: 032] Raise the Hadoop HDFS datanode heapsize to 2GB. [puppet] - 10https://gerrit.wikimedia.org/r/296899 (https://phabricator.wikimedia.org/T139071) (owner: 10Elukey) [14:51:06] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 13 failures [14:51:09] (03CR) 10Hashar: "It is being rebuild and puppet ships it in the image :-}" [puppet] - 10https://gerrit.wikimedia.org/r/293273 (https://phabricator.wikimedia.org/T133779) (owner: 10Ottomata) [14:51:29] godog: maybe the question i should be asking is, what should it do? should it deconfigure the ip interface? [14:51:38] godog: remove the systemd unit? [14:52:03] * urandom never tried removing one before [14:52:09] thanks hashar lemme know when i can try to recheck my other change [14:55:05] urandom: mhh I think you are right, just commenting the instance won't actively remove it from the system, but yeah systemd unit and the network/config/data dirs essentially [14:55:19] ottomata: the ci base image is ready and has the lib you requested [14:55:32] ottomata: gotta wait for the existing instance to disappear, since they use a previous image [14:55:45] godog: probably enough to manually remove the systemd unit, yes? [14:55:53] urandom: yup [14:56:29] (03PS1) 10Muehlenhoff: pybal_config: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297601 [14:56:40] godog: {{done}} [14:59:48] <_joe_> elukey: I'm depooling mw1261 so that I can install the new apache package [15:00:00] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2433252 (10RobH) a:05mark>03RobH Since promethium has since been allocated, I'm taking this task back to find another system to propose for approval to use on this task/project. [15:00:04] anomie, ostriches, thcipriani, hashar, twentyafterfour, and Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160706T1500). [15:00:05] Urbanecm and stephanebisson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:10] Present [15:00:15] _joe_ ack! [15:00:17] <_joe_> !log depooling mw1261, installing an apache package with additional fixes (T73487) [15:00:18] T73487: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487 [15:00:22] present [15:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:44] 06Operations, 06Discovery, 06Maps, 10Maps-data, 10hardware-requests: 2 servers for maps-beta cluster - https://phabricator.wikimedia.org/T138600#2405018 (10RobH) This doesn't state where the maps server being requested should reside, codfw or eqiad. Please advise. [15:01:49] (03CR) 10Alexandros Kosiaris: [C: 031] pybal_config: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297601 (owner: 10Muehlenhoff) [15:02:17] ottomata: was your patch https://gerrit.wikimedia.org/r/#/c/292755/ ? [15:02:26] !log poweroff prometheus2002 from dbrb -> plain conversion [15:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:37] hashar: yup! [15:02:51] ottomata: I did a recheck but pip fails due to an invalid requirement [15:02:52] I can SWAT today. Creating some tickets for log errors before doing so. [15:03:04] AssertionError: Sorry, 'git://github.com/confluentinc/confluent-kafka-python.git@master#egg=confluent-kafka' is a malformed VCS url. The format is +://, e.g. svn+http://myrepo/svn/MyApp#egg=MyApp [15:03:05] yeah i see it [15:03:10] 'git://github.com/confluentinc/confluent-kafka-python.git@master#egg=confluent-kafka' is a malformed VCS url. [15:03:11] hm [15:03:19] thcipriani: good morning. I got one filled for Translate related notices [15:03:23] i think its like that because i was testing a new untagged version [15:03:28] hashar: i'll work on it, thank you! [15:03:32] it should be able to build it now though, ja? [15:03:32] thcipriani: I see MaxSem merged a patch to wmf.9 yesterday but didn't deploy it. I think I can test it if you want to deploy it now. [15:03:35] if it can get it? [15:04:09] I couldn't deploy it because the checkout was in the middle of git am [15:04:33] oh good. [15:04:40] ottomata: on integration/config.git I have been using: git+https://gerrit.wikimedia.org/r/p/integration/zuul.git@8c250cfd5585e9e3a6ccbd14d9b77c9afe902036#egg=zuul [15:04:41] <_joe_> elukey: uhm seems my backport was a bit naive [15:04:53] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.642 second response time [15:04:59] ottomata: looks like you want git:// ---> git+https:// [15:05:06] oh, it'll pull from github like that? [15:05:11] we'll have a .deb in prod [15:05:18] and i'm trying to get them to make a new tag [15:05:36] ottomata: and ideally release it to pypi :-} [15:05:39] yeah [15:05:40] they will [15:05:45] trying just +https [15:06:23] ottomata: I think the idea is you can do stuff like: svn+ssh:// git+ssh:// and maybe git+git:// ;-) [15:06:34] PROBLEM - Apache HTTP on mw1261 is CRITICAL: Connection refused [15:06:56] ottomata: solved! https://integration.wikimedia.org/ci/job/tox-jessie/9467/console [15:07:02] ---^ we are working on mw1261 [15:07:06] (03PS1) 10Eevans: Upgrade remaining rack 'a' nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/297602 (https://phabricator.wikimedia.org/T126629) [15:07:24] (03CR) 10BryanDavis: [C: 031] tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 (https://phabricator.wikimedia.org/T133999) (owner: 10Yuvipanda) [15:07:33] ottomata: thank you to have taken care of adding the lib*-dev package to CI ;-} For flake8, you might want to pin it to an explicit version [15:07:53] great! , naw i'll just fix whatever it says [15:08:02] !log Disabling puppet on restbase101[0-1].eqiad.wmnet in preparation for 2.2.6 upgrade : T126629 [15:08:03] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:00] 06Operations, 06Discovery, 06Maps, 10Maps-data, 10hardware-requests: 2 servers for maps-beta cluster - https://phabricator.wikimedia.org/T138600#2433278 (10Gehel) The location is actually one of the question that needs to be answered. Those should be labs servers on physical hardware, so this indicates e... [15:10:15] (03PS1) 10Ema: Package 4.1.3-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/297603 [15:10:17] (03CR) 10Eevans: "Puppet is currently disabled on the two affected nodes; This can be merged at any time." [puppet] - 10https://gerrit.wikimedia.org/r/297602 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [15:12:39] hashar: SUCCESS, thakn you! [15:12:52] ottomata: it is magic ;-} [15:13:33] Are we SWATing today? [15:14:16] Urbanecm: yes [15:14:33] Ok hashar . Thx [15:15:11] anomie: MaxSem okie doke, am session fixed. Is "Show parser output for diffs unless extension aborts" the patch you both wanted deployed? [15:15:23] thcipriani: That's the one [15:15:29] yep [15:15:33] Urbanecm: yeah, there were some conflicts I had to resolve before I could SWAT [15:15:40] ok, I'll sync that patch out now. [15:15:49] Thanks. Are the conflicts fixed now? [15:16:14] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:17:01] Urbanecm: yup, just getting SWAT under way now [15:18:34] 06Operations: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2433318 (10akosiaris) [15:18:42] !log thcipriani@tin Synchronized php-1.28.0-wmf.9/includes/diff/DifferenceEngine.php: [[gerrit:297547|Show parser output for diffs unless extension aborts (T139433)]] (duration: 00m 30s) [15:18:43] T139433: Page content is no longer showing with diffs - https://phabricator.wikimedia.org/T139433 [15:18:45] ^ anomie MaxSem check please [15:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:21] thcipriani, confirmed [15:19:29] MaxSem: thank you! [15:19:30] thcipriani: works [15:19:41] anomie: thanks [15:19:53] 06Operations: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2433338 (10akosiaris) p:05Triage>03Normal [15:20:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297283 (https://phabricator.wikimedia.org/T139302) (owner: 10Urbanecm) [15:20:55] (03Merged) 10jenkins-bot: Enable autopatrolled user group at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297283 (https://phabricator.wikimedia.org/T139302) (owner: 10Urbanecm) [15:21:04] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:21:04] 06Operations, 13Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2433341 (10akosiaris) [15:21:06] 06Operations: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2433318 (10akosiaris) [15:22:48] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:297283|Enable autopatrolled user group at urwiki (T139302)]] (duration: 00m 30s) [15:22:49] T139302: Enable autopatrolled group at urwiki - https://phabricator.wikimedia.org/T139302 [15:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:53] ^ Urbanecm check please [15:23:33] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [15:23:36] Working. [15:23:40] Urbanecm: thank you! [15:23:47] You're welcome. [15:24:08] bearND: mobileapps flapped on scb1001. known ? [15:24:31] (03PS2) 10Thcipriani: Enable Echo transition flags everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297438 (owner: 10Mattflaschen) [15:24:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297438 (owner: 10Mattflaschen) [15:25:36] (03Merged) 10jenkins-bot: Enable Echo transition flags everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297438 (owner: 10Mattflaschen) [15:26:01] akosiaris: we had a lot of them last weekend due to memory issues but there was a fix by the ORES team to lower their memory requirements. I'm checking this one [15:26:22] the earlier one today was probably due to rb restart [15:27:08] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:297438|Enable Echo transition flags everywhere]] (duration: 00m 26s) [15:27:12] ^ stephanebisson check please [15:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:59] thcipriani: working [15:28:08] stephanebisson: thanks for checking [15:29:12] !log restarting the hdfs datanode on each analytics* Hadoop server to force the new -Xmx2048 heap setting to be picked up [15:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:18] anomie: MaxSem seeing this in the fatalmonitor, may be unrelated to your patch, but I didn't notice it earlier (it's not blowing up) Catchable fatal error: Argument 1 passed to DifferenceEngine::generateContentDiffBody() must implement interface Content, null given in /srv/mediawiki/php-1.28.0-wmf.9/includes/diff/DifferenceEngine.php on line 850 [15:29:27] akosiaris: heap limit exceeded [15:30:05] thcipriani, I saw that before, so not caused by it [15:30:19] bearND: some internal to nodejs I suppose. the box still has 6G of memory free [15:30:24] something* [15:30:27] thcipriani: i have seen that one earlier today [15:30:29] MaxSem: ack, just an FYI. [15:30:30] though likely related as in broken by the same commt I personally not sure should've been merged [15:31:13] thcipriani, I'll poke at it later [15:33:26] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2433419 (10akosiaris) [15:33:33] 06Operations: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2433432 (10akosiaris) [15:33:35] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2433431 (10akosiaris) [15:35:53] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 1 failures [15:36:33] (03CR) 10Filippo Giunchedi: [C: 031] Upgrade remaining rack 'a' nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/297602 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [15:36:35] !log thcipriani@tin Synchronized php-1.28.0-wmf.9/extensions/Echo/modules/styles: SWAT: [[gerrit:297594|Set width to Special:Notifications (T138433)]] (duration: 00m 30s) [15:36:35] T138433: Notifications page: Notification bodies are not truncated - https://phabricator.wikimedia.org/T138433 [15:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:42] ^ stephanebisson check please [15:37:17] (03CR) 10Eevans: [C: 031] Upgrade remaining rack 'a' nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/297602 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [15:38:34] thcipriani: strangely I don't see the fix on mw.org [15:38:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Upgrade remaining rack 'a' nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/297602 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [15:40:13] stephanebisson: hmmm, what about when adding ?debug=true to the url? [15:40:43] just looking at: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#A_note_on_JavaScript_and_CSS [15:40:51] (03PS2) 10Ema: Package 4.1.3-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/297603 [15:41:55] thcipriani: yes, now I see it, thank you! Reloading without cache was apparently not enough. [15:42:05] thcipriani: all good [15:44:02] stephanebisson: awesome, thanks for checking :) [15:44:29] !log Upgrading Cassandra to 2.2.6-wmf1 on restbase1010 : T126629 [15:44:30] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:53] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures [15:45:54] !log Re-enabling Puppet on restbase1010 : T126629 [15:45:55] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:57] !log Restarting Cassandra for restbase1010-a.eqiad.wmnet : T126629 [15:47:58] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-5/2/3 (Zayo, OGYX/120003//ZYO, 36ms) {#11519} [10Gbps wave]BR [15:51:23] (03CR) 10Gehel: [C: 031] spec fix for aptrepo and installserver [puppet] - 10https://gerrit.wikimedia.org/r/297378 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [15:51:52] godog: 1010-a is {{done}} [15:52:12] (03PS2) 10Gehel: spec fix for aptrepo and installserver [puppet] - 10https://gerrit.wikimedia.org/r/297378 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [15:52:14] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [15:54:41] urandom: ack! [15:57:57] godog: fyi, if you apply upgrade to only one instance, metrics collection ceases for the others [15:58:09] godog: this should be obvious, i guess [15:58:44] godog: or tl;dr, the only abnormality i see is that 1010-{b,c} have no metrics :) [15:59:34] urandom: yeah that makes sense, only one version of cmcd [15:59:44] !log Restarting Cassandra for restbase1010-b.eqiad.wmnet : T126629 [15:59:44] gehel: sorry for the puppet spec patch "spec fix for aptrepo and installserver" it is a bit messy : [15:59:45] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:10] gehel: I could not find a good way to split it up in smaller chunks [16:00:29] hashar: not all that much. You are moving tests from one module to the other, it make sense... [16:00:45] (03CR) 10Gehel: [C: 032] spec fix for aptrepo and installserver [puppet] - 10https://gerrit.wikimedia.org/r/297378 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [16:02:48] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2433478 (10mobrovac) RESTBase in production is running on 4.4.6 as of earlier today. [16:04:19] 06Operations, 06Services, 13Patch-For-Review, 15User-mobrovac: Updates various services to nodejs 4.4.6 - https://phabricator.wikimedia.org/T138561#2433481 (10mobrovac) >>! In T138561#2433243, @KartikMistry wrote: > Any plan to upgrade beta cluster along with this upgrade? You can update your BC hosts by... [16:05:12] (03CR) 10Gehel: [C: 031] "Manually checked the content of labsdb1004 and labsdb1006 against what Puppet declares, it looks good" [puppet] - 10https://gerrit.wikimedia.org/r/296551 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [16:05:41] gehel: will probably be able to have some experimental jenkins job soonish :-} [16:05:57] hashar: Youhouhou! [16:07:03] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: Connection refused [16:07:13] hashar: do you know how to mock functions in puppet tests? I was stuck on https://gerrit.wikimedia.org/r/#/c/297471/ ... [16:07:24] PROBLEM - cassandra-b service on restbase1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [16:09:54] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:15:13] gehel: monitoring works for cassandra :P [16:15:24] elukey: cool! [16:15:57] ^^^^ I have the Cassandra alert [16:16:03] alert(s) [16:16:34] RECOVERY - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is OK: TCP OK - 0.002 second response time on port 9042 [16:17:03] RECOVERY - cassandra-b service on restbase1010 is OK: OK - cassandra-b is active [16:18:37] elukey: you did not need to crash cassandra just to prove that alerting works :P [16:21:18] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2433549 (10Joe) In the end, we decided we need the following patches: - https:... [16:22:01] !log Restarting Cassandra for restbase1010-c.eqiad.wmnet : T126629 [16:22:02] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:22:05] gehel: that's all me [16:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:32] * urandom is the Crasher of Things [16:23:12] <_joe_> I am going off for now, ttyl [16:25:39] (03PS1) 10Yuvipanda: tools: Don't set CA explicitly for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/297611 (https://phabricator.wikimedia.org/T139461) [16:25:57] !log Upgrade of restbase1010.eqiad.wmnet instances complete : T126629 [16:25:58] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:12] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Hardware request for codfw WDQS server - https://phabricator.wikimedia.org/T138637#2433568 (10Gehel) [16:26:53] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [16:26:58] checking [16:27:04] ---^ [16:27:32] (03CR) 10JanZerebecki: "Looks good, waiting on whitelist ok." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296753 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore) [16:28:02] jzerebecki: it still needs branching and the submodule adding to the core branch now! [16:30:57] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Hardware request for codfw WDQS server - https://phabricator.wikimedia.org/T138637#2433590 (10Gehel) [16:31:06] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Hardware request for codfw WDQS server - https://phabricator.wikimedia.org/T138637#2433592 (10RobH) wdqs100[12] have the following hardware: * Dell PowerEdge R420 * Dual Intel(R) Xeon(R) CPU E5-2440 0 @ 2.40GHz (6 cores... [16:31:43] my fault, PEBKAC, I didn't wait too long before restarting journal nodes [16:32:35] I was restarting them to pick up the new Heap settings [16:32:46] we have automatic failover so all good [16:33:02] * elukey joins urandom as Crasher of Things [16:33:15] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: codfw: (2) wqds200[12] systems - https://phabricator.wikimedia.org/T138637#2433599 (10RobH) a:03RobH [16:34:27] RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [16:35:06] all good, sorry for the page [16:37:07] (03PS8) 10Addshore: Deploy RevisionSlider to test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296753 (https://phabricator.wikimedia.org/T138943) [16:39:02] !log Upgrading Cassandra package to 2.2.6-wmf1 on restbase1011.eqiad.wmnet : T126629 [16:39:03] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:08] cmjohnson1: hi! sorry to bother you, just wanted to have a chat about the broken disk on analytics1049 [16:39:49] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: codfw: (2) wqds200[12] systems - https://phabricator.wikimedia.org/T138637#2433624 (10RobH) [16:41:07] !log Restarting Cassandra fro restbase1011-a.eqiad.wmnet : T126629 [16:41:08] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:13] heh, fro [16:42:22] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: codfw: (2) wqds200[12] systems - https://phabricator.wikimedia.org/T138637#2433629 (10Smalyshev) I think we don't need //exact// copies of wdqs100*. Minimum reqs are 300G HD and 64G RAM, but whatever is more convenient... [16:45:22] !log Restarting Cassandra for restbase1011-b.eqiad.wmnet : T126629 [16:45:22] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:07] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:04] !log Restarting Cassandra for restbase1011-c.eqiad.wmnet : T126629 [16:50:04] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:53:59] !log Upgrade of restbase1011.eqiad.wmnet instances to Cassandra 2.2.6 complete : T126629 [16:54:00] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:42] (03PS1) 10Yuvipanda: tools: Don't include k8s::ssl when not necessary [puppet] - 10https://gerrit.wikimedia.org/r/297617 [16:57:28] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [17:08:16] 06Operations, 06Commons, 10media-storage, 07User-notice: Update rsvg on the image scalers to 2.40.16 (to solve several SVG rendering issues) - https://phabricator.wikimedia.org/T112421#2433741 (10Menner) Ping @MoritzMuehlenhoff : Noticed my last comment? [17:09:25] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:11:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:14:38] !log Disabling Puppet on restbase{1008,1012,1013}.eqiad.wmnet in preparation for rack 'b' Cassandra upgrade : T126629 [17:14:39] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [17:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:16:27] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:17:41] (03PS1) 10Eevans: Upgrade rack 'b' Cassandra nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/297619 (https://phabricator.wikimedia.org/T126629) [17:18:03] (03PS2) 10Yuvipanda: tools: Don't include k8s::ssl when not necessary [puppet] - 10https://gerrit.wikimedia.org/r/297617 [17:18:05] (03PS2) 10Yuvipanda: tools: Don't specify CA explicitly for client config [puppet] - 10https://gerrit.wikimedia.org/r/297600 (https://phabricator.wikimedia.org/T139461) [17:18:05] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures [17:18:07] (03PS3) 10Yuvipanda: tools: Use provisioned cert instead of puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/297595 (https://phabricator.wikimedia.org/T139461) [17:18:09] (03PS3) 10Yuvipanda: tools: Provision star.tools.wmflabs.org cert for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/297591 (https://phabricator.wikimedia.org/T139461) [17:18:11] (03PS2) 10Yuvipanda: tools: Don't set CA explicitly for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/297611 (https://phabricator.wikimedia.org/T139461) [17:18:13] (03PS33) 10Yuvipanda: tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 (https://phabricator.wikimedia.org/T133999) [17:18:27] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 (https://phabricator.wikimedia.org/T133999) (owner: 10Yuvipanda) [17:18:29] (03PS5) 10Dzahn: DHCP: remove promethium.eqad [puppet] - 10https://gerrit.wikimedia.org/r/297534 (https://phabricator.wikimedia.org/T120262) [17:18:31] (03CR) 10Eevans: [C: 031] "Puppet is disabled on the affected nodes; This can be merged at any time." [puppet] - 10https://gerrit.wikimedia.org/r/297619 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [17:18:41] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Provision star.tools.wmflabs.org cert for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/297591 (https://phabricator.wikimedia.org/T139461) (owner: 10Yuvipanda) [17:18:54] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't specify CA explicitly for client config [puppet] - 10https://gerrit.wikimedia.org/r/297600 (https://phabricator.wikimedia.org/T139461) (owner: 10Yuvipanda) [17:19:09] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Use provisioned cert instead of puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/297595 (https://phabricator.wikimedia.org/T139461) (owner: 10Yuvipanda) [17:19:23] (03CR) 10Yuvipanda: [C: 032] tools: Don't set CA explicitly for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/297611 (https://phabricator.wikimedia.org/T139461) (owner: 10Yuvipanda) [17:19:31] (03CR) 10Yuvipanda: [V: 032] tools: Don't set CA explicitly for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/297611 (https://phabricator.wikimedia.org/T139461) (owner: 10Yuvipanda) [17:19:46] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't include k8s::ssl when not necessary [puppet] - 10https://gerrit.wikimedia.org/r/297617 (owner: 10Yuvipanda) [17:21:27] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [17:21:46] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:23:07] (03CR) 10Alex Monk: "What about in modules/install_server/files/autoinstall/netboot.cfg ?" [puppet] - 10https://gerrit.wikimedia.org/r/297534 (https://phabricator.wikimedia.org/T120262) (owner: 10Dzahn) [17:28:15] (03PS6) 10Dzahn: DHCP: remove promethium.eqad [puppet] - 10https://gerrit.wikimedia.org/r/297534 (https://phabricator.wikimedia.org/T120262) [17:30:27] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: puppet fail [17:33:09] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433874 (10Gehel) jmxtrans 259 is now released: http://central.maven.org/maven2/org/jmxtrans/jmxtrans/259/ [17:35:55] (03PS1) 10Elukey: Revert "Raise the Hadoop HDFS datanode heapsize to 2GB." [puppet] - 10https://gerrit.wikimedia.org/r/297622 [17:36:18] (03PS2) 10Elukey: Revert "Raise the Hadoop HDFS datanode heapsize to 2GB." [puppet] - 10https://gerrit.wikimedia.org/r/297622 [17:36:28] (03PS3) 10Elukey: Revert "Raise the Hadoop HDFS datanode heapsize to 2GB." [puppet] - 10https://gerrit.wikimedia.org/r/297622 [17:37:44] (03CR) 10Elukey: [C: 032] Revert "Raise the Hadoop HDFS datanode heapsize to 2GB." [puppet] - 10https://gerrit.wikimedia.org/r/297622 (owner: 10Elukey) [17:41:22] 06Operations, 06Commons, 10media-storage, 07User-notice: Update rsvg on the image scalers to 2.40.16 (to solve several SVG rendering issues) - https://phabricator.wikimedia.org/T112421#2433897 (10MoritzMuehlenhoff) @Menner: I'll have a look tomorrow. [17:41:56] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:42:40] (03CR) 10Muehlenhoff: "I'll review this tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris) [17:47:34] (03PS1) 10Yuvipanda: Add python2 base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [17:47:46] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:47:53] (03CR) 10jenkins-bot: [V: 04-1] Add python2 base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 (owner: 10Yuvipanda) [17:48:20] (03PS2) 10Yuvipanda: Add python2 base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [17:48:48] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433931 (10fgiunchedi) thanks @Gehel ! this would likely allow us to fix {T97277} too [17:49:33] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433934 (10Gehel) @fgiunchedi Yep, that fix has been merged. [17:50:00] godog, elukey: I should probably take some time to understand how we use jmxtrans here. [17:50:35] thanks! [17:50:39] do we collect metrics on jmxtrans itself? It would help a lot for tuning implementation a bit more [17:50:49] no idea :( [17:51:26] should not be too hard to add... [17:51:49] gehel: no idea either, but let me know if you have questions on the graphite side [17:54:58] like reported in the ticket, I think we could switch from statsd to graphite/carbon heh [17:59:06] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:12:53] James_F / greg-g could I get one of you to look at and approve https://gerrit.wikimedia.org/r/#/c/296753/ with a +1? [18:14:18] addshore: I assume it's on beta cluster? [18:14:26] yup, and has been for a few weeks [18:14:40] maybe even a month :) [18:15:42] good :) [18:16:24] addshore: I'll let James_F comment re Beta Feature readiness, but I'm OK otherwise [18:16:55] godog: could you do the honors with: https://gerrit.wikimedia.org/r/#/c/297619/ ? [18:16:59] awesome, afaik the beta feature readiness is the only thing pending, per the inline docs above the whitelist setting [18:25:56] (03PS7) 10Dzahn: DHCP: fix promethium.eqad entry [puppet] - 10https://gerrit.wikimedia.org/r/297534 (https://phabricator.wikimedia.org/T120262) [18:27:12] (03PS8) 10Dzahn: DHCP: fix promethium.eqad entry [puppet] - 10https://gerrit.wikimedia.org/r/297534 (https://phabricator.wikimedia.org/T120262) [18:28:07] (03PS9) 10Dzahn: DHCP: fix promethium.eqad entry [puppet] - 10https://gerrit.wikimedia.org/r/297534 (https://phabricator.wikimedia.org/T120262) [18:28:11] (03CR) 10Rush: "labstore200[3|4] are somewhat anomalous here. They were hosts allocated for something coren had going ages ago. I did an inventory of la" [puppet] - 10https://gerrit.wikimedia.org/r/295513 (owner: 10Filippo Giunchedi) [18:29:13] (03CR) 10Dzahn: [C: 032] "just reverting what was accidentally broken to how it was before. if it makes sense or not that this is in carbon DNS others can decide" [puppet] - 10https://gerrit.wikimedia.org/r/297534 (https://phabricator.wikimedia.org/T120262) (owner: 10Dzahn) [18:32:20] elukey: ping [18:32:27] urandom: o/ [18:32:46] elukey: could i maybe get you to merge https://gerrit.wikimedia.org/r/#/c/297619 ? [18:32:57] it's an on-going upgrade, pretty routine [18:33:30] elukey: if you're not comfortable with that, no worries [18:33:41] nah it seems good! [18:33:47] sweet [18:33:59] (03PS2) 10Elukey: Upgrade rack 'b' Cassandra nodes to 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/297619 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [18:34:32] urandom: what is the plan? Just asking to have a clearer view [18:34:35] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 78 failures [18:35:24] are you going to install the new cassandra pkg on these? I am thinking what happens when puppet runs with the change on the hosts [18:35:27] i have puppet disabled on those hosts, one by one i will a) upgrade the package, b) enable puppet, c) run puppet, then restart the instances one by one [18:36:04] elukey: basically, what is listed at https://phabricator.wikimedia.org/T126629 under "Upgrade process" [18:36:09] +1, thanks :) [18:36:21] (03PS3) 10Ema: Package 4.1.3-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/297603 [18:36:22] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider restbase/cassandra for session storage (with SSL) - https://phabricator.wikimedia.org/T134811#2434198 (10Smalyshev) [18:36:25] 06Operations, 06Performance-Team, 06Services, 07Availability, and 3 others: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2434197 (10Smalyshev) 05Open>03Resolved [18:36:50] (03CR) 10Elukey: [C: 032] "Discussed the change with urandom on IRC, puppet is disabled on these hosts and will be re-enabled following the upgrade procedure." [puppet] - 10https://gerrit.wikimedia.org/r/297619 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [18:38:10] urandom: merged! [18:38:11] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2434215 (10chasemp) hang on, I'm confused on why we are pursuing this. The last public statement was we are not going to get entrenched in the hack of hardware in the instance subnet... [18:38:26] elukey: \o/ [18:38:29] thanks sir! [18:39:28] halfak: i am ready to merge the latest change for ores monitoring. it might restart all the nginx'es though [18:39:31] Hello. I've just got: "[V31QPgpAMFoAAE6HeHwAAAAV] 2016-07-06 18:38:54: Fatal exception of type "Exception"" [18:39:43] halfak: doesnt worry me much though .. unless it does for you [18:39:49] hey mafk [18:40:02] mutante, doesn't worry me. [18:40:08] hi Krenair - It was accessing fwiw [18:40:23] I can already predict the exception we're about to see [18:40:31] I'll keep a grafana tab open just in case. [18:40:32] Thanks :) [18:40:38] !log Upgrading Cassandra package to 2.2.6-wmf1 on restbase1008 : T126629 [18:40:39] mafk: Yeah here we are: Could not find local user data for J�zefO16@commonswiki [18:40:39] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [18:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:50] authmanager? [18:41:06] (03PS2) 10Dzahn: Change path to proxy node requests [puppet] - 10https://gerrit.wikimedia.org/r/297599 (https://phabricator.wikimedia.org/T134782) (owner: 10Nschaaf) [18:41:12] Didn't this bug pre-date authmanager? [18:41:21] * halfak watches activity slowly fade away from the labs installation [18:41:35] I don't know. [18:41:44] clueless at this things [18:41:50] mafk, anyway here's the task: https://phabricator.wikimedia.org/T119736 [18:42:05] (03CR) 10Dzahn: [C: 032] Change path to proxy node requests [puppet] - 10https://gerrit.wikimedia.org/r/297599 (https://phabricator.wikimedia.org/T134782) (owner: 10Nschaaf) [18:42:15] ktnx :) [18:42:55] !log Restarting Cassandra instance restbase1008-a : T126629 [18:42:56] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [18:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:59] looks like Gergo is manually repairing those [18:43:36] halfak: maybe you could run puppet on the labs nodes? [18:43:47] mutante, will do [18:43:58] cool [18:44:04] Just on the lb, right? [18:44:12] yes. lb, where nginx runs [18:44:15] Oh wait. Also on the precacher [18:44:52] meanwhile i will check on neon, prod icinga [18:46:14] !log mw1261 restart hhvm service [18:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:46:35] mutante, https://ores.wmflabs.org/node/ores-web-03/ works! \o/ [18:46:55] halfak: :) great [18:47:22] so we just need to wait for the change on icinga [18:47:30] then we could do that test [18:47:44] and break something [18:47:46] !log Restarting Cassandra instance restbase1008-b : T126629 [18:47:47] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [18:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:52] 06Operations, 10ops-codfw, 06DC-Ops, 10Incident-20150617-LabsNFSOutage: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2434301 (10Dzahn) RAID failure for this host in Icinga for the last > 3 days https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=labs... [18:51:40] !log labstore2001 - RAID failure in Icinga (is it T102626 ?) [18:51:41] T102626: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626 [18:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:52:46] !log Restarting Cassandra instance restbase1008-c : T126629 [18:52:47] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [18:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:12] !log mw1261 syntax error in Apache config [18:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:32] !log labstore1001 - failed backup [18:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:34] !log Cassandra upgrade of restbase1008.eqiad.wmnet instances complete : T126629 [18:55:35] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [18:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:35] !lo mw1133 puppet fail because out of memory, stop hhvm, run puppet [18:57:57] !log mw1133 puppet fail because out of memory, stop hhvm, run puppet [18:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:21] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160706T1900). Please do the needful. [19:00:18] (03CR) 10MaxSem: [C: 031] "Tested, works." [puppet] - 10https://gerrit.wikimedia.org/r/296551 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [19:00:22] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:01:00] 06Operations, 06Labs: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2434342 (10Dzahn) failed backups in Icinga again: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=labstore1001&service=Last+backup+of+the+maps+filesystem https://icinga.wikimedia.org... [19:07:12] !log Upgrading Cassandra package to 2.2.6-wmf1 on restbase1012.eqiad.wmnet : T126629 [19:07:13] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [19:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:00] !log Restarting Cassandra instance restbase1012-a.eqiad.wmnet : T126629 [19:09:01] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [19:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:17] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2434396 (10RobH) a:05RobH>03chasemp Sounds fine by me. I'm going to set this to keep this stalled & assign to @chasemp for his comment following his meeting about this. (If it s... [19:16:39] (03PS1) 10Ottomata: Frack uses analytics-eqiad kafka cluster! Switching back to ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/297631 [19:18:12] !log Restarting Cassandra instance restbase1012-b.eqiad.wmnet : T126629 [19:18:13] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [19:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:28] (03CR) 10Ottomata: [C: 032] "Reverts the broker ferm change from https://gerrit.wikimedia.org/r/#/c/297368" [puppet] - 10https://gerrit.wikimedia.org/r/297631 (owner: 10Ottomata) [19:20:43] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2434424 (10Andrew) OK, sorry -- I may have overpromised here. Can we get a clear description of what this is for and what specifically is insufficient about labs VMs for the solution... [19:23:09] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Incorrect text positioning in SVG rasterization (any extreme down scale) (fixed in upstream 2.40.13) - https://phabricator.wikimedia.org/T65703#680361 (10kaldari) Seems to still be an issue even though the image scalers have been updated to 2.40.16. Compa... [19:24:24] !log Restarting Cassandra instance restbase1012-c.eqiad.wmnet : T126629 [19:24:25] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [19:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:28:29] (03PS3) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [19:28:47] (03CR) 10jenkins-bot: [V: 04-1] Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 (owner: 10Yuvipanda) [19:28:49] (03PS4) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [19:30:42] !log Upgrade of restbase1012.eqiad.wmnet instances complete : T126629 [19:30:43] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [19:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:27] (03PS5) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [19:31:44] (03CR) 10jenkins-bot: [V: 04-1] Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 (owner: 10Yuvipanda) [19:31:54] (03PS6) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [19:38:29] (03PS7) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [19:38:44] (03PS8) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [19:38:47] (03CR) 10jenkins-bot: [V: 04-1] Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 (owner: 10Yuvipanda) [19:41:45] (03PS4) 10Gehel: Postgresql - allow multiple entries for the same user in pg_hba.conf [puppet] - 10https://gerrit.wikimedia.org/r/296551 (https://phabricator.wikimedia.org/T138092) [19:42:27] (03PS9) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [19:42:44] (03CR) 10jenkins-bot: [V: 04-1] Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 (owner: 10Yuvipanda) [19:43:00] (03PS10) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [19:43:42] (03PS2) 10Krinkle: Remove unused deprecated $wgStyleSheetPath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297511 [19:50:05] (03PS1) 1020after4: group1 wikis to 1.28.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297634 [19:52:09] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [19:52:19] (03CR) 10Gehel: [C: 032] Postgresql - allow multiple entries for the same user in pg_hba.conf [puppet] - 10https://gerrit.wikimedia.org/r/296551 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [19:55:42] (03CR) 1020after4: [C: 032] group1 wikis to 1.28.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297634 (owner: 1020after4) [19:56:22] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297634 (owner: 1020after4) [19:56:53] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.9 [19:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:56] !log Upgrading Cassandra package to 2.2.6-wmf1 on restbase1013.eqiad.wmnet : T126629 [19:57:57] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [19:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:25] !log deployed 1.28.0-wmf.9 to group1 wikis: T138555 [19:58:26] T138555: MW-1.28.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T138555 [19:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:50] ACKNOWLEDGEMENT - Apache HTTP on mw1261 is CRITICAL: Connection refused daniel_zahn T73487 [19:59:50] ACKNOWLEDGEMENT - puppet last run on mw1261 is CRITICAL: CRITICAL: puppet fail daniel_zahn T73487 [20:00:04] gwicke, cscott, arlolra, subbu, bearND, and mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160706T2000). [20:00:19] !log Restarting Cassandra instance restbase1013-a.eqiad.wmnet : T126629 [20:00:20] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:07] no parsoid deploy today [20:03:33] (03CR) 10Hashar: beta: send MariaDB errors to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) (owner: 10Hashar) [20:03:35] (03PS11) 10Yuvipanda: Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 [20:04:58] (03CR) 10jenkins-bot: [V: 04-1] Add python2 base + web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/297624 (owner: 10Yuvipanda) [20:05:09] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures [20:05:19] (03PS3) 10Hashar: beta: send MariaDB errors to syslog [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) [20:06:25] !log Restarting Cassandra instance restbase1013-b.eqiad.wmnet : T126629 [20:06:26] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:37] (03PS1) 10Nuria: Seeting up log retention policy in yarn [puppet/cdh] - 10https://gerrit.wikimedia.org/r/297643 (https://phabricator.wikimedia.org/T139178) [20:12:51] (03PS2) 10Nuria: Seeting up log retention policy in yarn [puppet/cdh] - 10https://gerrit.wikimedia.org/r/297643 (https://phabricator.wikimedia.org/T139178) [20:13:39] 06Operations, 10Wikimedia-SVG-rendering, 07Upstream: Incorrect text positioning in SVG rasterization (any extreme down scale) (fixed in upstream 2.40.13) - https://phabricator.wikimedia.org/T65703#2434711 (10Menner) Please be patient and see T112421 in the meantime. [20:13:53] !log Upgrade of restbase1013.eqiad.wmnet instances complete : T126629 [20:13:54] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:29] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:16:09] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [20:23:15] (03PS1) 10Eevans: Upgrade restbase1009 to Cassandra 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/297645 (https://phabricator.wikimedia.org/T126629) [20:23:47] !log Disable Puppet on restbase1009.eqiad.wmnet : T126629 [20:23:48] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:44] (03CR) 10Hashar: "And reading the doc I found out:" [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) (owner: 10Hashar) [20:30:39] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2434778 (10Cyberpower678) Cool with it being merged, when will it be deployed? [20:33:11] (03CR) 10Hashar: "I have restarted mysql on both instance. ps shows a new process:" [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) (owner: 10Hashar) [20:33:42] (03CR) 10Ottomata: Seeting up log retention policy in yarn (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/297643 (https://phabricator.wikimedia.org/T139178) (owner: 10Nuria) [20:33:57] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2385660 (10Reedy) >>! In T137973#2434778, @Cyberpower678 wrote: > Cool with it being merged, when will it be deployed? It'll be in .10 unless... [20:34:03] (03CR) 10Ottomata: [C: 032] Upgrade restbase1009 to Cassandra 2.2.6 [puppet] - 10https://gerrit.wikimedia.org/r/297645 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [20:34:28] thcipriani, that diff fatal really grew [20:34:33] anomie: https://phabricator.wikimedia.org/T137973#2434778 ? [20:34:49] Do you plan to backport that change? [20:34:57] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2434797 (10Cyberpower678) >>! In T137973#2434790, @Reedy wrote: >>>! In T137973#2434778, @Cyberpower678 wrote: >> Cool with it being merged, wh... [20:35:01] * thcipriani looks [20:36:19] MaxSem: yup, since wmf.9 moved over to group1 it's definitely a sizable error now. [20:36:20] Luke081515: I wasn't planning on doing it. Someone else might. [20:36:34] hm, ok [20:36:39] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2434800 (10Reedy) Should be everywhere by the 14th. Starting to be deployed on the 12th [20:37:21] and due to HHVM's wonderful error handling we can't even tell where is the broken caller [20:37:58] !log Restarting Cassandra instance restbase1008-a.eqiad.wmnet : T126629 [20:37:59] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:46] !log starting mobileapps deploy [20:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:40] !log deployed mobileapps 7a73789 [20:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:07] !log Restarting Cassandra instance restbase1008-b.eqiad.wmnet : T126629 [20:44:08] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:50] Luke081515: legoktm is going to backport it, he noticed that we need to deploy it everywhere at once or else the renames will break. [20:45:14] because they weren't already broken ;) [20:45:45] anomie, legoktm, heh, actually I think they won'T break, because there are no renames: https://meta.wikimedia.org/wiki/Special:Log/gblrename [20:45:48] :D [20:46:30] I mean if someone did a rename when it was half-deployed, the rename would fail to complete. [20:46:36] ok [20:46:53] and I think the global renamers would like a backport more, so they have more time to cleanup the backlog ;) [20:54:00] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2434857 (10Cyberpower678) >>! In T137973#2434800, @Reedy wrote: > Should be everywhere by the 14th. > > Starting to be deployed on the 12th (Y) [21:00:41] (03PS1) 10Kaldari: Test PageAssessments extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297687 (https://phabricator.wikimedia.org/T137918) [21:01:13] (03PS2) 10Kaldari: Test PageAssessments extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297687 (https://phabricator.wikimedia.org/T137918) [21:04:14] 07Puppet, 10Continuous-Integration-Infrastructure, 10Zuul: role::zuul::configuration should be replaced by hiera - https://phabricator.wikimedia.org/T139527#2434888 (10hashar) [21:04:26] (03Abandoned) 10BryanDavis: Add authmanager events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271028 (owner: 10CSteipp) [21:05:56] 07Puppet, 10Continuous-Integration-Infrastructure, 10Zuul, 07Technical-Debt: role::zuul::configuration should be replaced by hiera - https://phabricator.wikimedia.org/T139527#2434874 (10hashar) role::zuul::configuration is really just a hash of hash of settings that are used by the role class to invoke th... [21:07:27] ACKNOWLEDGEMENT - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures Gehel Issue with augeas, investigating, related to https://gerrit.wikimedia.org/r/#/c/296551/ [21:10:11] 07Puppet, 10Continuous-Integration-Infrastructure, 10Zuul, 07Technical-Debt: role::zuul::configuration should be replaced by hiera - https://phabricator.wikimedia.org/T139527#2434900 (10hashar) Maybe something like: ``` lang=yaml role::zuul::configuration::shared: labs: gerrit_server: fooobar.wmflab... [21:11:28] 07Puppet, 10Continuous-Integration-Infrastructure, 10Zuul, 07Technical-Debt: role::zuul::configuration should be replaced by hiera - https://phabricator.wikimedia.org/T139527#2434901 (10Paladox) I did this "zuul::server::gerrit_server": 127.0.0.1 "zuul::server::gerrit_user": jenkins "zuul::server::gearma... [21:23:08] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [21:33:49] (03PS3) 10Odder: Update logo settings for the Nepali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297134 (https://phabricator.wikimedia.org/T139240) [21:35:57] (03CR) 10Odder: "Had to update the spelling of the subline yet again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297134 (https://phabricator.wikimedia.org/T139240) (owner: 10Odder) [21:47:14] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:50:45] (03PS4) 10DCausse: Setup CirrusSearch continuous saneitization process to run via cron [puppet] - 10https://gerrit.wikimedia.org/r/297276 (https://phabricator.wikimedia.org/T139200) [21:55:00] (03PS1) 10Gehel: Correct scoping issues in role::osm::master [puppet] - 10https://gerrit.wikimedia.org/r/297703 [21:57:00] 06Operations, 10Mobile-Content-Service, 10ORES, 06Revision-Scoring-As-A-Service, and 2 others: Investigate increased memory pressure on scb1001/2 - https://phabricator.wikimedia.org/T139177#2435178 (10Ladsgroup) 05Open>03Resolved [21:58:26] (03CR) 10DCausse: Setup CirrusSearch continuous saneitization process to run via cron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/297276 (https://phabricator.wikimedia.org/T139200) (owner: 10DCausse) [22:18:24] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:18:28] !log maxsem@tin Synchronized php-1.28.0-wmf.9/includes/diff: T139526 (duration: 00m 38s) [22:18:29] T139526: Catchable fatal error: Argument 1 passed to DifferenceEngine::generateContentDiffBody() must implement interface Content, null given - https://phabricator.wikimedia.org/T139526 [22:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:19:02] 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2435303 (10RobH) [22:31:08] (03CR) 10Chad: [C: 04-1] Gerrit: Setup rsync between old and new machines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [22:35:51] (03PS3) 10Dzahn: admin: replace ssh key for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/297472 (https://phabricator.wikimedia.org/T139213) [22:37:35] (03PS4) 10Dzahn: admin: replace ssh key for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/297472 (https://phabricator.wikimedia.org/T139213) [22:37:37] (03CR) 10jenkins-bot: [V: 04-1] admin: replace ssh key for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/297472 (https://phabricator.wikimedia.org/T139213) (owner: 10Dzahn) [22:39:26] (03CR) 10Dzahn: [C: 032] "confirmed via edit on office wiki (and phab user, and IRC user)" [puppet] - 10https://gerrit.wikimedia.org/r/297472 (https://phabricator.wikimedia.org/T139213) (owner: 10Dzahn) [22:40:07] (jenkins-bot said +2 a minute later but for some reason that was not reported by the bot here) [22:42:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: New SSH key for AGreen - https://phabricator.wikimedia.org/T139213#2435463 (10Dzahn) a:03Dzahn Confirmed via office wiki edit (and phab user and irc user) https://office.wikimedia.org/w/index.php?title=User%3AAGreen_%28WMF%29&type=revision&diff=1913... [22:45:53] (03CR) 10EBernhardson: [C: 031] Setup CirrusSearch continuous saneitization process to run via cron [puppet] - 10https://gerrit.wikimedia.org/r/297276 (https://phabricator.wikimedia.org/T139200) (owner: 10DCausse) [22:48:40] (03CR) 10Dzahn: Gerrit: Setup rsync between old and new machines (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [22:48:49] (03PS9) 10Dzahn: Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [22:51:05] (03PS1) 10Ladsgroup: Enable ORES review tool as a beta feature in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297709 (https://phabricator.wikimedia.org/T139541) [22:51:07] (03CR) 10Chad: [C: 031] Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [22:51:13] (03PS10) 10Dzahn: Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [22:52:06] (03CR) 10Chad: [C: 031] Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [22:52:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: New SSH key for AGreen - https://phabricator.wikimedia.org/T139213#2435486 (10Dzahn) 05Open>03Resolved @AndyRussG it should work, please just reopen and ping me if there are any unexpected problems [22:53:54] (03CR) 10Dzahn: [C: 032] Gerrit: Setup rsync between old and new machines [puppet] - 10https://gerrit.wikimedia.org/r/296957 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160706T2300). Please do the needful. [23:00:04] James_F, kaldari, matt_flaschen, legoktm, and MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:21] hi. [23:00:34] I did too [23:00:37] hi [23:00:48] is jouncebot a minute early, or is my clock a minute late? :o [23:00:49] I don't know why I'm not pinged :D [23:00:59] MatmaRex_: your clock is late ;) [23:01:28] Present [23:01:34] and if James_F is around to approve something I was going to add something to SWAT, but it would seem now! [23:01:37] *not [23:02:25] 06Operations, 06Commons, 10media-storage: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2435518 (10kaldari) [23:02:37] 06Operations, 06Commons, 10media-storage: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2435530 (10kaldari) p:05Triage>03High [23:03:46] 06Operations, 06Commons, 10media-storage: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2435518 (10kaldari) [23:05:42] 06Operations, 06Commons, 10media-storage, 07User-notice: Update rsvg on the image scalers to 2.40.16 (to solve several SVG rendering issues) - https://phabricator.wikimedia.org/T112421#2435534 (10kaldari) Font anti-aliasing seems to be partially broken since the upgrade: T139543. [23:11:48] so, is anyone deploying? [23:12:13] let's go through the list… [23:12:16] roan is on vacation [23:12:18] ostriches is here. [23:12:32] krenair is probably asleep [23:12:35] MaxSem is also here. [23:12:43] I'm actually not [23:12:50] But I'm not supposed to be on the swat list anymore [23:13:03] so am I :P [23:13:09] well [23:13:11] * ostriches is definitely not here [23:13:11] yet I still deploy [23:13:22] dereckson is also probably asleep [23:13:26] that leaves ostriches and MaxSem [23:15:24] fineee [23:15:26] I'll deploy [23:15:48] :O [23:15:59] James_F: here? [23:16:04] kaldari: here? [23:16:12] yep [23:16:21] prolly wanna do mine last since it needs a scap [23:16:56] when was the extension added to extension-list? [23:17:21] matt_flaschen: doing yours now [23:18:31] legoktm: Sorry, just landed. [23:18:52] Plane took longer than expected. [23:19:07] ooooh [23:20:06] legoktm: a week ago [23:20:19] kaldari: so then you shouldn't need a scap [23:20:28] ah good, I wasn't sure [23:21:01] (03CR) 10Legoktm: [C: 04-1] "This doesn't need the $wg = $wmg hack, you can just set 'wg....' in InitialiseSettings.php and it'll work properly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296572 (https://phabricator.wikimedia.org/T133725) (owner: 10Jforrester) [23:21:35] Eh. [23:22:06] James_F: do you want to fix that now or should I just deploy it with a promise that you'll clean it up afterwards? :) [23:22:12] (03PS1) 10Chad: Gerrit: A few minor tweaks to rsync replication [puppet] - 10https://gerrit.wikimedia.org/r/297711 (https://phabricator.wikimedia.org/T125018) [23:22:22] legoktm: The latter please. :-) [23:22:25] (03CR) 10Legoktm: [C: 032] Test PageAssessments extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297687 (https://phabricator.wikimedia.org/T137918) (owner: 10Kaldari) [23:22:51] James_F: I'm looking for your approval on https://gerrit.wikimedia.org/r/#/c/296753/ too if you have time! [23:23:11] (03Merged) 10jenkins-bot: Test PageAssessments extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297687 (https://phabricator.wikimedia.org/T137918) (owner: 10Kaldari) [23:24:47] addshore: I don't right now, sorry. [23:24:55] (03PS2) 10Legoktm: VisualEditor: Move the citation button out of the primary toolbar on Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296572 (https://phabricator.wikimedia.org/T133725) (owner: 10Jforrester) [23:25:01] okay! I'll go to bed then ;) [23:25:05] !log legoktm@tin Synchronized wmf-config: Test PageAssessments extension on test.wikipedia.org - T137918 (duration: 00m 36s) [23:25:06] T137918: Test PageAssessements on test.wikipedia - https://phabricator.wikimedia.org/T137918 [23:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:14] testing [23:25:14] (03CR) 10Legoktm: [C: 032] "James promises he'll clean this up later :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296572 (https://phabricator.wikimedia.org/T133725) (owner: 10Jforrester) [23:25:51] it would be great if you could take a look and maybe give it a +1 in the next 12/24 hours! (so I can try and swat it tommorrow) [23:25:52] (03Merged) 10jenkins-bot: VisualEditor: Move the citation button out of the primary toolbar on Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296572 (https://phabricator.wikimedia.org/T133725) (owner: 10Jforrester) [23:26:24] legoktm, Undefined variable: wmgUsePageAssessments in /srv/mediawiki/wmf-config/CommonSettings.php on line 3054 [23:27:23] MaxSem: bleh, it's because I sync'd the entire dir at once [23:27:27] going down now [23:27:32] Oops. [23:27:46] We need a cleaner deployment system. [23:28:04] or always sync to a canary server first and verify it there first. [23:28:14] I see it only going up [23:28:26] It's not very difficult. And the chrome extension makes it easy for anyone else to help with the verification (e.g. for SWAT) [23:28:45] fatalmonitor shows it dropping... [23:29:03] no now its going up wtf [23:29:45] still flooding tail of the log [23:30:16] !log legoktm@tin Synchronized wmf-config: touch (duration: 00m 32s) [23:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:30] better? [23:31:11] yeah [23:31:12] ok [23:31:13] next [23:32:37] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: VisualEditor: Move the citation button out of the primary toolbar on Wikivoyes - T133725 (1/2) (duration: 00m 26s) [23:32:38] T133725: At Wikivoyage, move the Cite button into the Insert menu - https://phabricator.wikimedia.org/T133725 [23:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:54] !log legoktm@tin Synchronized wmf-config: VisualEditor: Move the citation button out of the primary toolbar on Wikivoyes - T133725 (2/2) (duration: 00m 30s) [23:33:55] T133725: At Wikivoyage, move the Cite button into the Insert menu - https://phabricator.wikimedia.org/T133725 [23:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:06] legoktm: In order for PageAssessments to actually be usable, it also needs a couple tables created on prod. Should I get jcrespo to do the table creation, or is that something you can do? [23:34:38] kaldari: first, you need to talk to him about a review on data leak [23:34:53] then you can just add it to SWAT [23:35:01] e.g. https://gerrit.wikimedia.org/r/#/c/297709/ [23:35:17] I'm saying since we went through the same before :) [23:35:20] kaldari, jcrespo needs to do it. See https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change [23:35:25] kaldari: has he already reviewed the tables? [23:35:35] legoktm: yes [23:35:36] James_F: btw yours is deployed [23:35:45] kaldari: https://phabricator.wikimedia.org/T137567 [23:36:08] kaldari: and all the data is public? [23:36:12] kaldari: submit a patch for https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/createExtensionTables.php please [23:36:25] matt_flaschen: https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change says table creation isn't a schema change [23:36:36] ok, Echo time [23:36:43] legoktm: Thanks. [23:36:45] yup, exactly [23:37:18] Oh, really? I didn't know that, sorry. [23:37:50] legoktm: cool, I'll take care of that. It's fine without the tables in the meantime (since you have to explicitly add a new parser function to a page in order to use them). [23:38:23] no it's not fine to create SQL fatals. just create the tables manually [23:38:27] ^ [23:38:42] I'll do it after the patches, but please also submit a patch for the script [23:39:11] there won't be any SQL fatals unless I try to use the new parser function on test.wiki. [23:39:20] which I won't do for now :) [23:39:22] or anyone else [23:39:31] which you can't control [23:39:55] true [23:41:24] !log legoktm@tin Synchronized php-1.28.0-wmf.9/extensions/Echo/: T139321, T139323 (duration: 00m 32s) [23:41:25] T139321: Bundled notifications: New lines preserved in text excerpts - https://phabricator.wikimedia.org/T139321 [23:41:25] T139323: Bundled notifications: The counter for bundled notifications shows sum of notifications of Alerts and Messages. - https://phabricator.wikimedia.org/T139323 [23:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:30] matt_flaschen: ^ [23:42:05] (03CR) 10Dzahn: [C: 032] Gerrit: A few minor tweaks to rsync replication [puppet] - 10https://gerrit.wikimedia.org/r/297711 (https://phabricator.wikimedia.org/T125018) (owner: 10Chad) [23:43:05] (03PS2) 10Legoktm: Enable ORES review tool as a beta feature in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297709 (https://phabricator.wikimedia.org/T139541) (owner: 10Ladsgroup) [23:43:18] legoktm: does that repo exist in gerrit or just GitHub? [23:43:26] it's a gerrit repo [23:43:42] 06Operations, 10hardware-requests: esams: (3?) SAS 2TB disks for ms-be* systems - https://phabricator.wikimedia.org/T138618#2435664 (10RobH) Except this is really a #procurement request (we dont generate discussion over spare shelf stuff in public space, and there is nothing but the pricing and approval.) I'l... [23:44:11] Amir1: do I deploy the patch first or create the SQL tables first? [23:44:24] legoktm: doesn't really matter [23:44:25] found it [23:44:30] https://gerrit.wikimedia.org/r/297713 [23:44:39] legoktm: I already made one for creating tables there [23:44:43] :D [23:45:10] for when you are done with this [23:45:12] thanks :) [23:45:30] thanks [23:46:09] legoktm, tested, looks good. [23:47:17] !log created ores_* tables on ruwiki [23:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:32] (03CR) 10Legoktm: [C: 032] Enable ORES review tool as a beta feature in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297709 (https://phabricator.wikimedia.org/T139541) (owner: 10Ladsgroup) [23:48:07] (03Merged) 10jenkins-bot: Enable ORES review tool as a beta feature in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297709 (https://phabricator.wikimedia.org/T139541) (owner: 10Ladsgroup) [23:49:19] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Enable ORES review tool as a beta feature in ruwiki - T139541 (duration: 00m 27s) [23:49:20] T139541: Deploy ORES review tool in Russian Wikipedia - https://phabricator.wikimedia.org/T139541 [23:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:09] !log running extensions/ORES/maintenance/CheckModelVersions.php and extensions/ORES/maintenance/PopulateDatabase.php on ruwiki [23:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:33] Amir1: populate script finished [23:50:42] ncie [23:50:43] *nice [23:50:47] I'm testing [23:52:28] !log legoktm@tin Synchronized php-1.28.0-wmf.9/includes/specials/SpecialContributions.php: Add mediawiki.special.changeslist to SpecialContributions - T139522 (duration: 00m 25s) [23:52:29] T139522: Change size indicators (e.g. (+25)) on Special:Contributions are no longer colored - https://phabricator.wikimedia.org/T139522 [23:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:52:42] legoktm: everything looks super slow to me, which might be because of my connection [23:52:52] MatmaRex_: deployed [23:52:53] Can you run a test? [23:53:00] https://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%9D%D0%B0%D1%81%D1%82%D1%80%D0%BE%D0%B9%D0%BA%D0%B8#mw-prefsection-betafeatures [23:53:14] (enable it here and then go to recent changes) [23:53:34] Amir1: gerrit seems very slow. Been waiting for my git review to finish for a couple minutes [23:53:40] Amir1: seems fine to me [23:53:52] https://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%A1%D0%B2%D0%B5%D0%B6%D0%B8%D0%B5_%D0%BF%D1%80%D0%B0%D0%B2%D0%BA%D0%B8&hidenondamaging=1 324ms [23:54:02] legoktm: cooool [23:54:04] legoktm: yay, https://www.mediawiki.org/wiki/Special:Contributions/Matma_Rex has colors again. thanks [23:54:10] (03CR) 10Dzahn: "sorry, don't know. leaving that to Mukunda and Chad" [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox) [23:54:10] np [23:54:35] kaldari: I remember Europe had connection issues with gerrit, one week ago [23:54:40] thanks [23:54:45] Amir1: also the second script has a typo: revsisions (hilarious because it gets repeated a few hundred times) [23:55:07] the populatedatabase? [23:55:10] kaldari: I need to create both tables? [23:55:11] Amir1: yes [23:55:17] legoktm: yes [23:55:22] :D [23:55:27] let me get that fixed [23:55:31] added it to the script here: https://gerrit.wikimedia.org/r/#/c/297714/ [23:56:15] !log created pageassesments tables on testwiki [23:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:39] just me left then [23:58:15] legoktm: one question: once "Make LocalRename jobs run sequentially" is deployed in wmf.8 and 9. Is it possible to do global rename again? [23:58:27] someone told we should wait until 15th [23:58:48] yes, but I need to test it actually works [23:59:17] would you like to help with testing? :) [23:59:49] yeah, I don't know much about SUL but I would love to help (I'm a global renamer if that helps)