[00:27:47] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:55:46] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [01:34:56] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1817.086555 Seconds [01:34:56] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1817.833235 Seconds [01:35:56] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 20.460992 Seconds [01:35:56] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 21.254631 Seconds [01:43:46] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:58:26] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:59:36] (03PS3) 10BBlack: nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 [01:59:38] (03PS2) 10BBlack: add stapling-multi-file patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320115 [02:11:46] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:16:45] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 05m 39s) [02:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Nov 7 02:21:02 UTC 2016 (duration 4m 18s) [02:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:26] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [02:41:42] (03PS4) 10BBlack: nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 [02:41:44] (03PS1) 10BBlack: remove debian perl ldflags patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320161 [02:41:46] (03PS1) 10BBlack: depend on lsb-base >= 3.0-6 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320162 [02:53:56] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:56:56] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:22:56] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:24:56] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:27:26] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 696.88 seconds [03:32:26] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.73 seconds [03:59:46] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:46] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [04:01:56] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.010 second response time [04:04:46] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code [04:26:46] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:38:43] [05:45] * fishkin (~fishkin@wikipedia/Gamliel-Fishkin) has joined #wikimedia-tech [04:38:43] [05:47] Hi, I want to report a bug. Is someone here to read my message? [04:38:43] [05:50] The MediaWiki default bot, which updates translated pages from the Translatewiki, does not update ate least 3 pages: MediaWiki:Cite section preview references/eo, MediaWiki:Cite warning/eo, MediaWiki:Cite warning sectionpreview no text/eo. [04:39:16] scap-i18n update related i guess [04:53:39] arseny92: maybe, a task would be good (#scap and #i18n) [04:57:46] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:21:46] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:21:52] greg-g T150155 [05:21:53] T150155: Localization updates do not update some eo messages - https://phabricator.wikimedia.org/T150155 [05:23:56] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:23:57] arseny92: ty [05:25:46] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [05:48:46] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [05:51:56] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:04:36] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [06:05:36] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [06:29:36] PROBLEM - Disk space on logstash1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:30:26] PROBLEM - Disk space on logstash1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:31:16] PROBLEM - Disk space on logstash1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:40:46] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - 531 bytes in 0.026 second response time [06:51:46] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.056 second response time [06:59:50] <_joe_> ugh logstash [07:02:26] RECOVERY - Disk space on logstash1001 is OK: DISK OK [07:02:36] <_joe_> !log removing old logfiles on logstash hosts [07:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:16] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:05:16] RECOVERY - Disk space on logstash1002 is OK: DISK OK [07:05:36] RECOVERY - Disk space on logstash1003 is OK: DISK OK [07:21:16] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:34:37] (03PS2) 10Muehlenhoff: carbon_pickled: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/319878 [07:41:46] (03PS1) 10Marostegui: db-codfw.php: Depool db2042 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320166 (https://phabricator.wikimedia.org/T149553) [07:41:56] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [07:52:54] (03CR) 10Muehlenhoff: [C: 032] Depend on new ABI name [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/319870 (owner: 10Muehlenhoff) [07:53:56] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.220 second response time [08:03:36] PROBLEM - MariaDB Slave SQL: x1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:03:36] PROBLEM - MariaDB Slave SQL: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:03:56] PROBLEM - MariaDB Slave IO: m2 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:03:56] PROBLEM - MariaDB Slave SQL: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:04:06] PROBLEM - MariaDB Slave SQL: m3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:04:06] PROBLEM - MariaDB Slave IO: m3 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:04:06] PROBLEM - MariaDB Slave SQL: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:04:06] PROBLEM - MariaDB Slave IO: s1 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:04:06] PROBLEM - MariaDB Slave IO: s7 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:04:07] PROBLEM - MariaDB Slave SQL: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:04:16] PROBLEM - MariaDB Slave SQL: m2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:04:16] PROBLEM - MariaDB Slave SQL: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:04:16] PROBLEM - MariaDB Slave IO: s3 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:04:16] PROBLEM - MariaDB Slave IO: x1 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:04:16] PROBLEM - MariaDB Slave SQL: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:04:17] PROBLEM - MariaDB Slave IO: s6 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:04:17] PROBLEM - MariaDB Slave IO: s2 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:04:18] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [08:04:26] PROBLEM - MariaDB Slave IO: s4 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:04:26] PROBLEM - MariaDB Slave IO: s5 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [08:05:40] :( [08:05:45] Looks like it went out from downtime [08:06:38] <_joe_> ueaj [08:06:43] <_joe_> *yeah [08:06:49] I gave it a month of downtime while we test [08:08:04] _joe_: Heya… [08:08:51] Whatever ‘load balancing’ algorithm the video scalers use, it apparently sucks. [08:09:02] <_joe_> Revent: there is no load balancing there [08:09:08] <_joe_> it's a job processing system [08:09:42] Yeah, what I mean is…. sometimes it will crap all the tasks on one machine, and put it at like 75-80% load, while the other sits idle. [08:10:41] <_joe_> doesn't seem like that's the case tbh looking at the graphs from last week [08:11:01] <_joe_> but yeah, that can in theory happen under very particular circumstances [08:11:01] Re-transcoding these big broken files, even at 720p, sometimes fails even when the machines don’t go over 50% or so. [08:11:47] <_joe_> that's because transcoding is mostly a single-cpu job [08:11:54] <_joe_> and those machines have multiple cpus [08:12:10] (03PS1) 10Muehlenhoff: Bump changelog [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/320167 [08:12:52] Right, I know, it’s just that trying to fix these ‘big’ transcodes, it seems rather sensitive to even fairly low levels of load [08:13:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bump changelog [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/320167 (owner: 10Muehlenhoff) [08:13:25] When it’s dumped a lot on one and not the other, it’s not really apparent in the longer graphs because the ‘spikes’ aren’t that long. [08:20:52] _joe_: Lets put it this way… yeah, it’s a single cpu job, and there are ~30 cpus, that would imply you could transcode a ‘lot’ of files at once, but if you throw even 5x of these big transcodes at 720p on at once, they will all error out. I can’t diagnose it, ofc, but trying to get these fixed seems quite touchy. [08:21:20] <_joe_> we have a total of 8 cpus [08:21:25] <_joe_> which is not much in fact [08:21:35] <_joe_> did you open a bug by any chance? [08:21:46] Umm… lol, ganglia lies then. [08:21:57] <_joe_> sorry, 16 [08:22:08] <_joe_> the 32 cores you see there are from HT [08:22:13] <_joe_> (hyperthreading) [08:22:14] Ah. [08:22:37] <_joe_> but yeah, throwing in a couple more machines could be a good idea [08:22:51] <_joe_> actually, this is the classical case where an elastic environment would be optimal [08:23:31] TBH, it’s not the ‘big’ transcodes I care about, so much, but the thousands of little ones… they just aren’t in the short list that timedmediahandler makes visible. [08:24:19] in IS at ln14076 , 'wmgRSSUrlWhitelist' => ['mediawikiwiki' => has 'https://git.wikimedia.org/feed/mediawiki/extensions/Translate.git', . I guess that should be changed to the phab clone url as per T139089? The redirect rule goes to nowhere tbh but to a list of all repos instead [08:24:19] T139089: Fix references to git.wikimedia.org in all repos - https://phabricator.wikimedia.org/T139089 [08:24:59] But knowing there are ‘half’ the cores that it says rather makes the level of jobs where it starts erroring them out make a lot more sense. :P [08:25:43] <_joe_> Revent: the OS will see the number of cores reported by ganglia, but that's mostly an artifact [08:25:44] Howeve that var wants a rss it seems and I can't seem to find repo rss [08:26:35] Yeah, makes sense… also makes sense that hyperthreading won’t help with transcoding much if at all. [08:28:32] _joe_: FYI, I’ve not opened a bug because, tbh, I can’t really describe the ‘problem’ well at all, other than the transcoding system is twitchy and fails a lot on large videos. :P [08:29:10] <_joe_> Revent: and, still very honestly, I would love to work on it but I have zero time for it [08:29:18] <_joe_> I might open a ticket about it, though [08:29:29] Yeah, I understand that there are other more urgent things. [08:30:10] <_joe_> it's more that we're very very thin on personnel :( [08:30:33] brion is already looking it afaik, he needs to fix several issues first (namely the issue where JobRubber thinks a file is done when its not causing it to load another job with overloads the system) [08:30:35] !log Deploy schema change on s4 master (db2019) commonswiki.revision - T147305 [08:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:42] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [08:30:59] <_joe_> p858snake|L2: yes that's the main issue that needs solving on the dev side [08:32:07] p858snake|L2: A specific thing I have noticed is that there are broken transcodes in the ‘timedmediahandler’ list that are not shown as broken on the file page, but are… [08:32:26] have you opened a task about that issue? [08:33:08] I’ve mainly just been trying to ‘fix’ the transcodes. [08:33:49] But, sometimes in the middle of a list of transcodes that took hours, there will be one that supposedly was ‘successful’ in like 5 minutes. [08:33:58] (03CR) 10Jcrespo: [C: 031] "Ok on codfw only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320166 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [08:34:26] I guess I can open a ticket for (and leave broken) the next ones I notice. [08:36:33] p858snake|L2: https://commons.wikimedia.org/wiki/File:20160914_Meeting_of_the_Presidents_Export_Council_HD.webm <- the 480P OGG transcode [08:37:08] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2042 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320166 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [08:37:40] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2042 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320166 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [08:38:01] Revent: yes, leaving stuff broken so people can look into the causes is generally a good idea, our crystal balls aren't always the best to try find causes after things have been repaired [08:38:54] (lol) [08:39:43] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2042 for maintenance - T149553 (duration: 00m 50s) [08:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:49] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [08:42:16] 06Operations, 10Wikimedia-General-or-Unknown: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2775570 (10MoritzMuehlenhoff) [08:43:22] 06Operations, 10Wikimedia-General-or-Unknown: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2773188 (10MoritzMuehlenhoff) Instead of repurposing the codfw scaler (we actually have only a single one) we should rather expand the capacity in codfw. Also, both video scalers in eqiad... [08:44:43] !log stopping mysql on db2042 - maintenance- T149553 [08:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:19] p858snake|L2: https://phabricator.wikimedia.org/T150158 <- probably not a great description. [08:46:28] !log uploaded linux-meta 1.11 to carbon (pointing to the new Linux ABI package) [08:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:15] (03PS2) 10Gilles: Set environment variables for ImageMagick running inside Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/319807 (https://phabricator.wikimedia.org/T149985) [09:10:18] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2775602 (10Gilles) [09:10:21] 06Operations, 06Performance-Team, 10Thumbor: Ask firejail upstream about ability to turn off pid namespacing - https://phabricator.wikimedia.org/T149981#2775600 (10Gilles) 05Open>03Resolved > You cannot turn off PID namespace, it is hardcoded deep inside the program. If you do "ps aux" outside of sandbox... [09:26:41] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2760850 (10jcrespo) p:05Triage>03Normal [09:29:36] (03PS2) 10Ema: site: apply role::systemtap::devserver to copper [puppet] - 10https://gerrit.wikimedia.org/r/319616 [09:29:44] (03CR) 10Ema: [C: 032 V: 032] site: apply role::systemtap::devserver to copper [puppet] - 10https://gerrit.wikimedia.org/r/319616 (owner: 10Ema) [09:34:51] <_joe_> ema: please include that role inside role::builder [09:35:18] <_joe_> I can do it for you, but I am trying to reduce the number of places where hiera calls can end up [09:36:44] _joe_: sounds good, I'll add you to the reviewers :) [09:37:12] (03PS1) 10Gehel: elasticsearch - /etc/elasticsearch/scripts required for elasticsearch start up [puppet] - 10https://gerrit.wikimedia.org/r/320168 [09:45:30] (03PS1) 10Ema: Add role::systemtap::devserver to role::builder [puppet] - 10https://gerrit.wikimedia.org/r/320169 [10:03:59] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2775679 (10Amire80) p:05Triage>03Normal [10:04:26] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2748980 (10Amire80) p:05Normal>03Triage [10:07:31] !log performing schema change on s5 (imagelinks) T139090 [10:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:38] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [10:09:39] 06Operations: Remote IPMI doens't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160#2775695 (10Volans) [10:19:28] !log rebooting bast4001 for kernel update [10:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:47] (03PS2) 10Ema: Add role::systemtap::devserver to role::builder [puppet] - 10https://gerrit.wikimedia.org/r/320169 [10:26:52] !log rebooting cp1008 for kernel update [10:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:43] (03CR) 10Gehel: "puppet compiler: https://puppet-compiler.wmflabs.org/4552/" [puppet] - 10https://gerrit.wikimedia.org/r/320168 (owner: 10Gehel) [10:28:47] (03PS2) 10Gehel: elasticsearch - /etc/elasticsearch/scripts required for elasticsearch start up [puppet] - 10https://gerrit.wikimedia.org/r/320168 [10:30:30] (03PS1) 10Giuseppe Lavagetto: docker::registry: allow fetching images from the internet [puppet] - 10https://gerrit.wikimedia.org/r/320172 [10:31:17] (03CR) 10Giuseppe Lavagetto: [C: 031] Add role::systemtap::devserver to role::builder [puppet] - 10https://gerrit.wikimedia.org/r/320169 (owner: 10Ema) [10:31:42] <_joe_> ema: a sanity check on my change would be appreciated :) [10:32:42] (03CR) 10Ema: [C: 032 V: 032] Add role::systemtap::devserver to role::builder [puppet] - 10https://gerrit.wikimedia.org/r/320169 (owner: 10Ema) [10:33:52] _joe_: looking [10:35:14] (03PS2) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) [10:39:22] <_joe_> is zuul down again? [10:39:51] (03PS2) 10Giuseppe Lavagetto: docker::registry: allow fetching images from the internet [puppet] - 10https://gerrit.wikimedia.org/r/320172 [10:39:53] (03CR) 10Ema: [C: 04-1] docker::registry: allow fetching images from the internet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320172 (owner: 10Giuseppe Lavagetto) [10:40:45] _joe_: a few comments, plus the change fails to compile (https://puppet-compiler.wmflabs.org/4557/) [10:40:56] <_joe_> yeah I know [10:41:06] <_joe_> I already fixed _that_ [10:41:35] <_joe_> not your other comments [10:41:42] k [10:44:15] heh also when setting be_opts we sometimes use integers and other times strings for the port numbers [10:44:35] (03PS3) 10Gehel: elasticsearch - /etc/elasticsearch/scripts required for elasticsearch start up [puppet] - 10https://gerrit.wikimedia.org/r/320168 [10:44:46] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:46] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:46:20] (03CR) 10Gehel: [C: 032] elasticsearch - /etc/elasticsearch/scripts required for elasticsearch start up [puppet] - 10https://gerrit.wikimedia.org/r/320168 (owner: 10Gehel) [10:48:11] (03PS3) 10Giuseppe Lavagetto: docker::registry: allow fetching images from the internet [puppet] - 10https://gerrit.wikimedia.org/r/320172 [10:48:22] (03CR) 10Giuseppe Lavagetto: docker::registry: allow fetching images from the internet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320172 (owner: 10Giuseppe Lavagetto) [10:49:06] !log rebooting mw1017/mw1099 for kernel update [10:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:24] (03CR) 10Gehel: "puppet compiler seems to agree: https://puppet-compiler.wmflabs.org/4555/" [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [10:53:51] (03PS4) 10Giuseppe Lavagetto: docker::registry: allow fetching images from the internet [puppet] - 10https://gerrit.wikimedia.org/r/320172 [10:54:46] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:04] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/301192 (owner: 10Hashar) [11:02:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] Maps - tilerator on all maps servers needs access to postgresql master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [11:07:27] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2775860 (10akosiaris) @AlexMonk-WMF As @Joe said, we 've copied over the CA in production (2 months ago in fact). palladium has already been shutdown.... [11:23:46] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:26:38] 06Operations, 10Prod-Kubernetes, 10Traffic, 05Kubernetes-production-experiment: Make our docker registry public - https://phabricator.wikimedia.org/T150168#2775898 (10Joe) [11:27:05] 06Operations, 10Prod-Kubernetes, 10Traffic, 05Kubernetes-production-experiment: Make our docker registry public - https://phabricator.wikimedia.org/T150168#2775912 (10Joe) p:05Triage>03Normal [11:27:18] (03CR) 10Alexandros Kosiaris: [C: 031] "I am not in love with the place of the file (/etc/ferm) but I get why it's proposed to be there and I have no better proposal. So +1" [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [11:27:23] 06Operations, 10Traffic, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2775914 (10Danielsberger) Ok, here are the new results for cache sizes between 50GB and 400GB. For now, I only looked at the Filter and Exp admission policies. Disclaimer: The... [11:27:26] (03PS5) 10Giuseppe Lavagetto: docker::registry: allow fetching images from the internet [puppet] - 10https://gerrit.wikimedia.org/r/320172 (https://phabricator.wikimedia.org/T150168) [11:30:41] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/320172 (https://phabricator.wikimedia.org/T150168) (owner: 10Giuseppe Lavagetto) [11:33:41] !log rebooting cassandra test hosts (cerium, praseodymium, xenon) for kernel update [11:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:05] (03PS1) 10Giuseppe Lavagetto: Add entry for docker-registry.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/320177 (https://phabricator.wikimedia.org/T150168) [11:35:36] 06Operations, 10Traffic: reimage cp4016 and cp1055 - https://phabricator.wikimedia.org/T149843#2775933 (10ema) 05Open>03Resolved a:03ema The hosts have been reimaged on 2016-11-02. [11:37:46] (03CR) 10Alexandros Kosiaris: [C: 032] ores (labs): Define log directory in worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/319984 (https://phabricator.wikimedia.org/T149925) (owner: 10Ladsgroup) [11:37:53] (03PS2) 10Alexandros Kosiaris: ores (labs): Define log directory in worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/319984 (https://phabricator.wikimedia.org/T149925) (owner: 10Ladsgroup) [11:37:56] (03CR) 10Alexandros Kosiaris: [V: 032] ores (labs): Define log directory in worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/319984 (https://phabricator.wikimedia.org/T149925) (owner: 10Ladsgroup) [11:40:39] !log cp3043: repool varnish-be and varnish-be-rand (T149881) [11:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:46] T149881: varnish-be not restarting correctly because of disk space issues - https://phabricator.wikimedia.org/T149881 [11:40:56] <_joe_> I am waiting since 5 minutes for jenkins to catch up on a dns change [11:42:46] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:48:03] (03PS1) 10Ema: cache_misc: use integers for port numbers [puppet] - 10https://gerrit.wikimedia.org/r/320179 [11:52:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Add entry for docker-registry.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/320177 (https://phabricator.wikimedia.org/T150168) (owner: 10Giuseppe Lavagetto) [11:52:26] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2775954 (10elukey) Quick summary: Brandon and Ema debugged the upload issue and figured out that it was related to the absence of a... [11:56:09] (03PS1) 10Ema: cache_text esams: route to codfw [puppet] - 10https://gerrit.wikimedia.org/r/320180 (https://phabricator.wikimedia.org/T131503) [11:57:17] (03PS1) 10Ema: cache_text: upgrade esams to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/320181 (https://phabricator.wikimedia.org/T131503) [11:57:19] (03PS6) 10Giuseppe Lavagetto: docker::registry: allow fetching images from the internet [puppet] - 10https://gerrit.wikimedia.org/r/320172 (https://phabricator.wikimedia.org/T150168) [11:59:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] docker::registry: allow fetching images from the internet [puppet] - 10https://gerrit.wikimedia.org/r/320172 (https://phabricator.wikimedia.org/T150168) (owner: 10Giuseppe Lavagetto) [12:00:33] !log rebooting wtp1001 for kernel update [12:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:10] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2775967 (10Kelson) @Andrew @RobH @chasemp @AlexMonk-WMF Thank you for taking time to answer to this ticket. I'm now back on this topic after a long summer pause. Purpose: Create ZIM... [12:02:46] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:03:05] (03PS1) 10Giuseppe Lavagetto: docker::registry: fix parser function call [puppet] - 10https://gerrit.wikimedia.org/r/320182 [12:03:06] <_joe_> that's me ^^ [12:05:33] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: fix parser function call [puppet] - 10https://gerrit.wikimedia.org/r/320182 (owner: 10Giuseppe Lavagetto) [12:06:46] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:09:19] !log performing schema change on s6 (imagelinks) T139090 [12:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:26] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [12:10:36] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2775972 (10elukey) From http://book.varnish-software.com/4.0/chapters/Tuning.html: ``` Varnish operates with multiple pools of thread... [12:11:46] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:12:46] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [12:14:44] 06Operations: Remote IPMI doens't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160#2775982 (10MoritzMuehlenhoff) Updating the host provisioning docs in combination with a daily Icinga check sounds like the best approach to me. [12:19:17] 06Operations, 10OTRS: Intermittent 503 errors on OTRS ticket system when sending responses to tickets - https://phabricator.wikimedia.org/T148299#2775991 (10Josve05a) Hmm, seems to have been resolved (or maybe I'm not active enough on OTRS to catch it anymore), but ever since you'vve asked me to save the next... [12:27:48] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2775995 (10ArielGlenn) This needs: reading by experts, lots of cleanup, simplification probably, plus see todos at t... [12:28:35] (03PS1) 10Giuseppe Lavagetto: docker::registry: wrap nginx directives in a location [puppet] - 10https://gerrit.wikimedia.org/r/320184 (https://phabricator.wikimedia.org/T150168) [12:31:57] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: wrap nginx directives in a location [puppet] - 10https://gerrit.wikimedia.org/r/320184 (https://phabricator.wikimedia.org/T150168) (owner: 10Giuseppe Lavagetto) [12:38:19] (03PS1) 10Giuseppe Lavagetto: docker::registry: return 403 instead of 405 as an error [puppet] - 10https://gerrit.wikimedia.org/r/320185 (https://phabricator.wikimedia.org/T150168) [12:40:10] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2776049 (10ArielGlenn) @demon We had talked about the difference in memory usage for cobalt and elasticsearch, how t... [12:40:49] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: return 403 instead of 405 as an error [puppet] - 10https://gerrit.wikimedia.org/r/320185 (https://phabricator.wikimedia.org/T150168) (owner: 10Giuseppe Lavagetto) [12:42:46] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:49:59] jouncebot: neilpquinn [12:50:02] jouncebot: next [12:50:02] In 1 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T1400) [12:50:04] sorry [12:54:58] Hi everybody. How is $wgLocaltimezone set for cswiki? It seems it isn't good... It should be CET/CEST. It depends if summer time is now. [12:56:04] 06Operations, 10Traffic, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2776071 (10Danielsberger) [[ https://github.com/dasebe/libvmod-cacheadmission/blob/master/vcl/ExpLRU.vcl | Here ]]'s some VCL/inline c that implements the exp-size amission poli... [12:56:08] 'cswiki' => 'Europe/Prague', // T73902 [12:56:09] T73902: Set timezone for cs.Wikipedia and cs.Wikinews - https://phabricator.wikimedia.org/T73902 [12:56:47] Btw, this is a question for #wikimedia-tech [12:57:06] Sorry. I'm in this channel, I'll ask for other related question there. [12:59:15] 06Operations, 10OTRS: Intermittent 503 errors on OTRS ticket system when sending responses to tickets - https://phabricator.wikimedia.org/T148299#2776074 (10akosiaris) 05Open>03Invalid @Josve05a Hm. Weird. Anyway, I am tentatively resolving as Invalid for now. Don't hesitate to reopen with a log if it ther... [13:06:12] (03PS5) 10Reedy: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) [13:06:36] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:36] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:36] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:36] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:36] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:46] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:47] !log rebooting scandium for kernel update [13:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:56] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:56] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:56] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:56] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:56] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:56] (03CR) 10Reedy: "PS5 adds --oldcaptcha" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [13:06:57] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:06] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:06] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:06] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:16] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:16] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:16] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:26] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:26] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:46] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:16] (03CR) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [13:13:26] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:13:26] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:13:26] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:13:36] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:13:36] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:13:37] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2776093 (10Fjalapeno) [13:13:46] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:13:46] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:13:46] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:14:13] (03CR) 10BBlack: [C: 031] cache_misc: use integers for port numbers [puppet] - 10https://gerrit.wikimedia.org/r/320179 (owner: 10Ema) [13:14:16] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:14:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:14:56] jynus: ^ ? [13:17:46] 06Operations, 10Wikimedia-General-or-Unknown: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2776098 (10Reedy) Sounds like a good plan. I guess a newer CPU generation or two is going to provide some reasonable gains to begin with [13:17:56] marostegui: ? [13:19:28] !log shutting down Nodepool (labnodepool1001.eqiad.wmnet reboot) [13:19:30] bblack: not me, but let me check [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:47] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2776101 (10Fjalapeno) [13:22:06] PROBLEM - Disk space on graphite1002 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [13:22:40] I do not know what it is [13:22:50] did it go down? [13:22:52] No [13:22:56] it is with too many connections [13:23:00] (03PS1) 10Urbanecm: Whitelisting domain for GWToolset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320190 (https://phabricator.wikimedia.org/T150167) [13:23:23] the schema change? [13:23:32] which schema change? [13:23:40] see sal [13:23:49] 12:09 < jynus> !log performing schema change on s6 (imagelinks) T139090 [13:23:50] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [13:24:27] but that should only affect s6 [13:24:37] and it should have not production traffic [13:25:26] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:25:26] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:25:26] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:25:26] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:25:36] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87911.68 seconds [13:25:36] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87668.65 seconds [13:25:37] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:25:46] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:25:47] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:25:48] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:25:48] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:25:48] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:25:48] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:25:48] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 1078.31 seconds [13:25:48] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87677.32 seconds [13:25:48] it does not have an extra port [13:25:49] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87665.32 seconds [13:25:52] like production [13:25:56] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:25:56] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [13:25:57] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:25:59] we need to fix that [13:26:06] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [13:26:06] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:26:06] RECOVERY - Disk space on graphite1002 is OK: DISK OK [13:26:06] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:26:16] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [13:26:16] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 80196.83 seconds [13:26:16] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 80409.84 seconds [13:26:16] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:26:25] !log rebooting labnodepool1001 for kernel update [13:26:26] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87589.95 seconds [13:26:26] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 976.94 seconds [13:26:26] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [13:26:27] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [13:26:27] Waiting for table metadata lock would explain it [13:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:42] yes, it is still there indeed [13:26:53] but not why it happened [13:26:59] I am checking the graphs and I do not see any spikes really (at least yet) [13:27:09] this database should have no traffic to justify a metadata lock pilup [13:27:20] unlike production [13:29:36] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2776128 (10Fjalapeno) [13:29:47] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2776129 (10ArielGlenn) >>! In T148478#2750773, @Dzahn wrote: > `root@cobalt:~# java -XX:+PrintFlagsFinal -version |... [13:32:06] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:32:22] maybe there is some dump process happening at the same time, I will check cron [13:34:01] !log Flushed nodepool instances. It is bringing up fresh one now. [13:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:30] nope, dumps are "0 1 * * 3" [13:35:33] Anyone here queried dbstore1001? It is not a problem, and it would give a reason for the issues- which right now are unknown [13:35:46] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:36:32] "alter table imagelinks DROP INDEX il_backlinks_namespace, ADD INDEX il_backlinks_namespace (il_from_namespace,il_to,il_from)" is running [13:36:33] !log reboot wdqs2* for kernel upgrade [13:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:47] but it requires other queries to create the metadata lock [13:38:22] (03CR) 10Faidon Liambotis: [C: 04-1] "I think I'd like to see this in the same commit that applies this (you mentioned a systemd unit?), as otherwise it's too much of a noop." [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [13:39:44] (03PS3) 10Hashar: Enable wgAbuseFilterProfile at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319566 (https://phabricator.wikimedia.org/T149899) (owner: 10Urbanecm) [13:39:46] (03PS5) 10Hashar: Enable Extension:ShortURL on bd.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [13:39:48] (03PS4) 10Hashar: Rename 'autopatrol' to 'autopatrolled' on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [13:39:50] (03PS3) 10Hashar: Allow local sysops to add accountcreator group in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319805 (https://phabricator.wikimedia.org/T149986) (owner: 10Urbanecm) [13:39:52] (03PS3) 10Hashar: Allow reviewers to stabilize pages in Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319808 (https://phabricator.wikimedia.org/T149987) (owner: 10Urbanecm) [13:39:54] (rebased patches for SWAT) [13:39:54] (03PS3) 10Hashar: Enable $wgAbuseFilterProfile at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319569 (https://phabricator.wikimedia.org/T149901) (owner: 10MarcoAurelio) [13:41:06] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:41:19] https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=dbstore1001&from=1478523156994&to=1478525958122 [13:41:57] hashar: You'll have to rebase them again... :P [13:42:00] After you start merging them [13:42:28] jouncebot: next [13:42:28] In 0 hour(s) and 17 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T1400) [13:42:44] (03PS1) 10Thiemo Mättig (WMDE): Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 [13:42:50] (having a break/coffee before swat) [13:43:28] Reedy: not sure why ? once I get a full chain that is on the tip of the branch [13:43:35] I can just CR+2 each of them in whatever order [13:43:40] and they will eventually all land [13:43:44] brb [13:43:45] I thought it prevented merge commits on the repo? [13:44:06] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:44:15] Reedy: well they are fast forward right now :] [13:44:36] (03CR) 10Faidon Liambotis: [C: 04-1] Check whether ferm has been correctly started (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff) [13:49:56] PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:18] jynus, marostegui ^ ? [13:50:31] paravoid: me, came back from downtime [13:50:36] [15:43] Reedy: not sure why ? once I get a full chain that is on the tip of the branch ->> yes but one one is merged the branch becomes newer than the rest [13:50:45] it has hardware issues [13:50:47] :( [13:51:58] arseny92: that is handled [13:52:00] hashar: what's the plan with swat today? there is plenty of patches [13:52:01] !log depooling cp4018 nginx+varnish-fe services for debugging [13:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:30] zeljkof: going to deploy them all at once except the last config change that requires a script to be run [13:53:39] hashar: Ok, so you are doing swat today then? [13:54:26] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.69 seconds [13:54:56] RECOVERY - Host db2034 is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms [13:55:13] zeljkof: yeah i will [13:56:53] !log reboot wdqs1* for kernel upgrade [13:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:19] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.69 seconds -> that is the alter table probably (it is running) [13:59:07] (03PS1) 10Urbanecm: Enable Extension:ShortUrl for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320193 (https://phabricator.wikimedia.org/T150166) [13:59:34] hashar zeljkof Will it be possible to deploy nine patches? If it would I'll schedule it for another window. [13:59:45] Urbanecm: yes [13:59:51] 06Operations, 10Monitoring: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2776176 (10Volans) @fgiunchedi @akosiaris @Joe From the looks of it (I've take just a quick look, correct me if I'm wrong): - Puppet spam is caused by a missing sorting in the generated file from `nagge... [14:00:02] Okay. So I'm going to add it to the calendar. [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T1400). [14:00:04] yurik, Urbanecm, and mafk: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:06] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:00:27] yurik not around for https://gerrit.wikimedia.org/r/#/c/320160/ :/ [14:00:48] Added. [14:00:53] will try [14:01:52] 320190 there's no aligning whitespaces to show it as table [14:02:33] see all the other lines are aligned [14:02:46] Going to add. Sorry [14:03:02] here [14:03:37] (03CR) 10Faidon Liambotis: [C: 04-1] "Hardcoding our expiry thresholds for either GlobalSign or Let's Encrypt all over the tree isn't a great idea. It makes it much harder to a" [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [14:03:57] yurik: good morning! I have pulled the Kartographer patch on mw1099 [14:04:04] yurik: but I have no idea how to verify it works fine [14:04:16] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:17] hashar, testing... [14:04:35] (03PS2) 10Urbanecm: Whitelisting domain for GWToolset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320190 (https://phabricator.wikimedia.org/T150167) [14:04:39] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2776187 (10Papaul) @madhuvishyi rebuild the RAID you should be able to see all disk now on H800. Let me know if you have any other questions.... [14:05:24] arseny92: Fixed. [14:05:25] k good [14:05:31] Thanks. [14:05:43] hashar, works [14:06:13] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2776190 (10faidon) So it's unclear to me what the next step is and who should be acting on this now; is it @RobH or @Cmjohnson? [14:07:47] (03CR) 10Faidon Liambotis: [C: 031] "LGTM, but perhaps we should consider renaming the script then, as "reimage --new" is an oxymoron :)" [puppet] - 10https://gerrit.wikimedia.org/r/318304 (https://phabricator.wikimedia.org/T148816) (owner: 10Volans) [14:08:41] paravoid: --no-re (in the sense of removing the 're' from the name) :-P ^^^ [14:08:49] hah [14:09:33] (03PS2) 10Faidon Liambotis: mail: add an empty statement for 4.87+ compatibility [puppet] - 10https://gerrit.wikimedia.org/r/316956 [14:09:48] (03PS2) 10BBlack: remove debian perl ldflags patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320161 [14:09:50] (03PS2) 10BBlack: depend on lsb-base >= 3.0-6 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320162 [14:09:52] (03PS5) 10BBlack: nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 [14:09:54] (03PS3) 10BBlack: add stapling-multi-file patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320115 [14:10:12] yurik: syncing [14:10:35] (03CR) 10Faidon Liambotis: [C: 032] mail: add an empty statement for 4.87+ compatibility [puppet] - 10https://gerrit.wikimedia.org/r/316956 (owner: 10Faidon Liambotis) [14:10:49] (03CR) 10Faidon Liambotis: [V: 032] mail: add an empty statement for 4.87+ compatibility [puppet] - 10https://gerrit.wikimedia.org/r/316956 (owner: 10Faidon Liambotis) [14:10:56] !log hashar@tin Synchronized php-1.29.0-wmf.1/extensions/Kartographer/extension.json: Fix monobook (missing debounce dep) T145521 (duration: 00m 47s) [14:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:02] T145521: does not work in Monobook skin - https://phabricator.wikimedia.org/T145521 [14:13:14] reviewing other patches [14:14:04] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319566 (https://phabricator.wikimedia.org/T149899) (owner: 10Urbanecm) [14:14:21] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319805 (https://phabricator.wikimedia.org/T149986) (owner: 10Urbanecm) [14:14:34] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319808 (https://phabricator.wikimedia.org/T149987) (owner: 10Urbanecm) [14:14:39] (03Merged) 10jenkins-bot: Enable wgAbuseFilterProfile at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319566 (https://phabricator.wikimedia.org/T149899) (owner: 10Urbanecm) [14:15:03] (03Merged) 10jenkins-bot: Allow local sysops to add accountcreator group in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319805 (https://phabricator.wikimedia.org/T149986) (owner: 10Urbanecm) [14:15:12] (03Merged) 10jenkins-bot: Allow reviewers to stabilize pages in Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319808 (https://phabricator.wikimedia.org/T149987) (owner: 10Urbanecm) [14:15:29] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320190 (https://phabricator.wikimedia.org/T150167) (owner: 10Urbanecm) [14:15:56] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:16:30] bah [14:16:56] (03PS3) 10Hashar: Whitelisting domain for GWToolset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320190 (https://phabricator.wikimedia.org/T150167) (owner: 10Urbanecm) [14:17:07] (03CR) 10Hashar: Whitelisting domain for GWToolset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320190 (https://phabricator.wikimedia.org/T150167) (owner: 10Urbanecm) [14:17:12] (03CR) 10Hashar: [C: 032] Whitelisting domain for GWToolset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320190 (https://phabricator.wikimedia.org/T150167) (owner: 10Urbanecm) [14:17:43] (03Merged) 10jenkins-bot: Whitelisting domain for GWToolset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320190 (https://phabricator.wikimedia.org/T150167) (owner: 10Urbanecm) [14:18:24] https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=958951&oldid=958914 [14:18:26] I got confused at some point [14:19:10] 311656 has dependson [14:19:26] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 158.70 seconds [14:19:38] ehich is only noted on gerrit [14:19:52] (03PS2) 10Ema: cache_misc: use integers for port numbers [puppet] - 10https://gerrit.wikimedia.org/r/320179 [14:19:58] (03CR) 10Ema: [C: 032 V: 032] cache_misc: use integers for port numbers [puppet] - 10https://gerrit.wikimedia.org/r/320179 (owner: 10Ema) [14:20:15] hashar: At which point exactly? [14:20:30] why there's no consistency when one adds stuff to deploy and not note the tasks or epending commits [14:20:35] by a rename in a gerrit change [14:20:37] not a big deal [14:20:50] I have pushed on mw1099 the first four changes by Urbanecm [14:21:04] Okay. So I'm going to test them. I'll let you know. [14:21:08] hashar: ^ [14:21:24] hashar be sure to note the tasks in log when you sync ) [14:21:34] per ^ [14:21:44] https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=958951&oldid=958914 [14:23:46] hashar: 319805 can be deployed [14:24:00] (03CR) 10Hashar: [C: 04-1] "The extensions requires a database schema change. Eg loading schemas/shorturls.sql and really I have no idea how we are handling them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320193 (https://phabricator.wikimedia.org/T150166) (owner: 10Urbanecm) [14:24:22] (03PS2) 10Giuseppe Lavagetto: RESTBase: Add baseUriTemplate parameter. [puppet] - 10https://gerrit.wikimedia.org/r/319897 (owner: 10Ppchelko) [14:24:56] hashar: Why C-1? My change is wrong someway? I already requested enabling new extension which almost always needs new tables... I'll try to find something after testing. [14:24:56] RECOVERY - MariaDB Slave SQL: s6 on dbstore2001 is OK: OK slave_sql_state not a slave [14:24:56] RECOVERY - MariaDB Slave IO: m2 on dbstore2001 is OK: OK slave_io_state not a slave [14:25:06] RECOVERY - MariaDB Slave IO: m3 on dbstore2001 is OK: OK slave_io_state not a slave [14:25:06] RECOVERY - MariaDB Slave SQL: s5 on dbstore2001 is OK: OK slave_sql_state not a slave [14:25:06] RECOVERY - MariaDB Slave SQL: s7 on dbstore2001 is OK: OK slave_sql_state not a slave [14:25:06] RECOVERY - MariaDB Slave SQL: m3 on dbstore2001 is OK: OK slave_sql_state not a slave [14:25:06] RECOVERY - MariaDB Slave IO: s7 on dbstore2001 is OK: OK slave_io_state not a slave [14:25:07] RECOVERY - MariaDB Slave IO: s1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [14:25:13] (03CR) 10Hashar: [C: 04-1] "The extension requires a database schema change. Eg loading schemas/shorturls.sql and really I have no idea how we are handling them. Skip" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [14:25:16] RECOVERY - MariaDB Slave IO: s2 on dbstore2001 is OK: OK slave_io_state not a slave [14:25:16] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [14:25:16] RECOVERY - MariaDB Slave SQL: s2 on dbstore2001 is OK: OK slave_sql_state not a slave [14:25:16] RECOVERY - MariaDB Slave SQL: s1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [14:25:16] RECOVERY - MariaDB Slave SQL: m2 on dbstore2001 is OK: OK slave_sql_state not a slave [14:25:17] RECOVERY - MariaDB Slave IO: x1 on dbstore2001 is OK: OK slave_io_state not a slave [14:25:17] RECOVERY - MariaDB Slave IO: s6 on dbstore2001 is OK: OK slave_io_state not a slave [14:25:18] RECOVERY - MariaDB Slave IO: s3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [14:25:26] RECOVERY - MariaDB Slave IO: s4 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: No, (no error: intentional) [14:25:26] RECOVERY - MariaDB Slave IO: s5 on dbstore2001 is OK: OK slave_io_state not a slave [14:25:29] hashar: 319808 can be deployed too [14:25:36] RECOVERY - MariaDB Slave SQL: s4 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [14:25:36] RECOVERY - MariaDB Slave SQL: x1 on dbstore2001 is OK: OK slave_sql_state not a slave [14:25:43] (03PS4) 10Hashar: Enable $wgAbuseFilterProfile at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319569 (https://phabricator.wikimedia.org/T149901) (owner: 10MarcoAurelio) [14:26:04] (03CR) 10Hashar: [C: 032] Enable $wgAbuseFilterProfile at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319569 (https://phabricator.wikimedia.org/T149901) (owner: 10MarcoAurelio) [14:26:29] hashar , by using the wikimediamaintenance extension scripts stuff etc iirc [14:26:50] (03Merged) 10jenkins-bot: Enable $wgAbuseFilterProfile at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319569 (https://phabricator.wikimedia.org/T149901) (owner: 10MarcoAurelio) [14:26:50] going to add [config] 319569 Enable $wgAbuseFilterProfile at Meta-Wiki [14:27:03] (03CR) 10Giuseppe Lavagetto: [C: 032] RESTBase: Add baseUriTemplate parameter. [puppet] - 10https://gerrit.wikimedia.org/r/319897 (owner: 10Ppchelko) [14:27:05] the couple changes that enable shortUrl, i got to do the schema change [14:27:10] so holding for now [14:28:24] hashar , by using the wikimediamaintenance extension scripts stuff etc iirc [14:28:38] to create tables for new extensions [14:28:39] Urbanecm: guess I will deploy all five changes in one go :} [14:28:53] hashar: I'm only reporting the progress :) [14:28:55] <_joe_> mobrovac: running puppet on the restbase hosts [14:29:20] 320190 is untestable at mw1099 because I have no rights to try it. [14:29:56] 319566 can be deployed. [14:30:01] Urbanecm: I am syncing them all [14:30:06] looks fine to me [14:30:15] hashar: All except 320190 was tested by me and 320190 is untestable for me. [14:30:50] !log hashar@tin Synchronized wmf-config: (no message) (duration: 00m 53s) [14:30:52] (03CR) 10Ottomata: [C: 031] confluent::kafka::mirror::jmxtrans: key attr is declared more than once [puppet] - 10https://gerrit.wikimedia.org/r/319770 (owner: 10Dzahn) [14:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:02] hashar be sure to note the tasks in log when you sync ) [14:31:06] uh [14:31:14] just assume the patches got deployed and close tasks [14:31:28] OK [config] 319566 Enable wgAbuseFilterProfile at cswiki [14:31:28] OK [config] 319805 Allow local sysops to add accountcreator group in fiwiki [14:31:30] OK [config] 319808 Allow reviewers to stabilize pages in Finnish Wikipedia [14:31:30] OK [config] 320190 CopyUploadsDomain addition [14:31:32] OK [config] 319569 Enable $wgAbuseFilterProfile at Meta-Wiki [14:32:07] Thanks hashar. I'll close them. About the database changes, maybe https://wikitech.wikimedia.org/wiki/How_to_do_a_schema_change can help? [14:32:09] (03PS1) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) [14:32:16] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:32:40] (03CR) 10Hashar: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [14:33:43] !log reboot maps-test* for kernel upgrade [14:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:53] (03CR) 10jenkins-bot: [V: 04-1] Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [14:34:14] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2776282 (10Ottomata) The various myspell* packages were requested in T99030 and T121011. @halfak can comment as to whether they are still needed. zpubsub was create... [14:34:50] (03CR) 10Hashar: [C: 032] Rename 'autopatrol' to 'autopatrolled' on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [14:35:06] https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change [14:35:30] hashar: Closed. [14:35:35] creating new tables is not schema change [14:35:39] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/319875 (owner: 10Muehlenhoff) [14:36:19] wikimedia maintenance ext script is used iirc for new tables creation [14:37:22] (03PS5) 10Hashar: Rename 'autopatrol' to 'autopatrolled' on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [14:37:41] (03CR) 10Hashar: Rename 'autopatrol' to 'autopatrolled' on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [14:37:47] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [14:38:27] (03Merged) 10jenkins-bot: Rename 'autopatrol' to 'autopatrolled' on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) (owner: 10MarcoAurelio) [14:39:02] (03PS2) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) [14:40:12] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Rename 'autopatrol' to 'autopatrolled' on fawiki - T144699 T139246 (duration: 00m 47s) [14:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:19] T144699: Rename 'autopatrol' to 'autopatrolled' on fawiki - https://phabricator.wikimedia.org/T144699 [14:40:19] T139246: Migrate local group names to WikimediaMessages - https://phabricator.wikimedia.org/T139246 [14:41:24] log fawiki: renaming user group 'autopatrol' to 'autopatrolled' for T139246 and T144699 with: mwscript migrateUserGroup.php --wiki=fawiki 'autopatrol' 'autopatrolled' [14:42:34] !log fawiki Done! 417 users in group 'autopatrol' are now in 'autopatrolled' instead. [14:42:38] !log fawiki: renaming user group 'autopatrol' to 'autopatrolled' for T139246 and T144699 with: mwscript migrateUserGroup.php --wiki=fawiki 'autopatrol' 'autopatrolled' [14:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:44] (03PS2) 10Volans: wmf-auto-reimage: add option --new for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/318304 (https://phabricator.wikimedia.org/T148816) [14:43:56] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:44:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/3/2: down - BR [14:44:52] (03PS1) 10KartikMistry: Explicitly set cookieDomain for ContentTranslationSiteTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) [14:45:26] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 234, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/3/2: down - BR [14:46:36] doing the shortUrls changes now [14:46:48] (03PS2) 10Hashar: Enable Extension:ShortUrl for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320193 (https://phabricator.wikimedia.org/T150166) (owner: 10Urbanecm) [14:46:50] (03PS6) 10Hashar: Enable Extension:ShortURL on bd.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [14:47:34] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320193 (https://phabricator.wikimedia.org/T150166) (owner: 10Urbanecm) [14:47:45] hashar: Okay. [14:48:21] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [14:48:36] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:48:53] (03Merged) 10jenkins-bot: Enable Extension:ShortURL on bd.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) (owner: 10MarcoAurelio) [14:49:00] (03Merged) 10jenkins-bot: Enable Extension:ShortUrl for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320193 (https://phabricator.wikimedia.org/T150166) (owner: 10Urbanecm) [14:49:21] https://gerrit.wikimedia.org/r/320202 to add ShortUrl to createExtensionTables :P [14:49:25] !log terbium: scap pull to add shortUrl tables to bdwikimedia and tcywiki [14:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:40] eekkk [14:52:42] (03CR) 10Volans: [C: 032] wmf-auto-reimage: add option --new for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/318304 (https://phabricator.wikimedia.org/T148816) (owner: 10Volans) [14:52:48] (03PS1) 10ArielGlenn: add local cruft to .gitignore [dumps] - 10https://gerrit.wikimedia.org/r/320204 [14:53:57] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2776362 (10Fjalapeno) [14:54:39] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2772430 (10Fjalapeno) [14:54:56] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:25] (03CR) 10Muehlenhoff: "My systemd unit-based approach is at https://gerrit.wikimedia.org/r/#/c/320197/, but I'll look into the ferm-internal post hook. I didn't " [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [14:56:08] Urbanecm you're in charge of your own workboard so I not touching it (T150166) [14:56:08] T150166: Create short URL for Tulu (tcy) Wikipedia - https://phabricator.wikimedia.org/T150166 [14:57:24] arseny92: I don't use any Done column in my workboard. All is ok :) [14:57:55] going to try to do the schema change :D [14:58:28] uh [14:59:17] hashar: mwscript sql.php --wiki=whatever php-1.29.0-wmf.1/extensions/ShortUrl/schemas/shorturl.sql [14:59:35] Or do you need to cat | [14:59:54] either works :) [15:00:57] !log Deactivate cr1-eqiad BGP peering with pfw1-eqiad [15:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:17] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2776390 (10Fjalapeno) [15:01:32] !log T150166 mwscript sql.php --wiki=tcywiki /srv/mediawiki/php-1.29.0-wmf.1/extensions/ShortUrl/schemas/shorturls.sql [15:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:37] T150166: Create short URL for Tulu (tcy) Wikipedia - https://phabricator.wikimedia.org/T150166 [15:01:55] !log T146014 mwscript sql.php --wiki=bdwikimedia /srv/mediawiki/php-1.29.0-wmf.1/extensions/ShortUrl/schemas/shorturls.sql [15:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:03] T146014: Enable Extension:ShortUrl on chapterwiki of WMBD - https://phabricator.wikimedia.org/T146014 [15:02:13] !log rebooting mw1261-mw1265 (canary app servers) for kernel update [15:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:37] (03CR) 10ArielGlenn: [C: 032] add local cruft to .gitignore [dumps] - 10https://gerrit.wikimedia.org/r/320204 (owner: 10ArielGlenn) [15:02:50] !log T150166 mwscript extensions/ShortUrl/populateShortUrlTable.php --wiki=tcywiki (1569 titles done) [15:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:03] !log T146014 mwscript extensions/ShortUrl/populateShortUrlTable.php --wiki=bdwikimedia (714 titles done) [15:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:04] ah finally [15:04:11] got the shortUrl enabled on mw1099 [15:04:56] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2776402 (10Ottomata) Sounds perfect, thank you. [15:05:12] (03PS1) 10Giuseppe Lavagetto: docker::registry: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/320206 [15:05:18] (03CR) 10ArielGlenn: "Well after a ridiculously long time I am ready to make this happen. We need to think about where the defaults file should live (hardcoded" [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [15:05:44] (03CR) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [15:05:53] !log Chris moved cr1-eqiad:xe-5/0/3 to xe-3/3/2 [15:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:01] !log Reactivate cr1-eqiad BGP peering with pfw1-eqiad [15:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:30] !log Enabling gtid_domain_id db1020 (m2 master) - T149418 [15:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:35] T149418: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418 [15:08:27] Urbanecm: arseny92 Reedy took me a while but I got shortUrl added ;} [15:08:38] !log Deactivate cr2-eqiad BGP peering with pfw1-eqiad [15:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:46] Good! All is completed at this time? [15:08:52] hashar: the patch I made to wikimediamaintenance will make it easier for future :) [15:09:20] maybe we should just enable shortUrl everywhere ? [15:09:22] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: shortUrl for bdwikimedia and tcywiki T146014 and T150166 (duration: 01m 51s) [15:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:28] T146014: Enable Extension:ShortUrl on chapterwiki of WMBD - https://phabricator.wikimedia.org/T146014 [15:09:28] T150166: Create short URL for Tulu (tcy) Wikipedia - https://phabricator.wikimedia.org/T150166 [15:09:35] Reedy , maybe patch it to support every extension that we run? [15:09:41] arseny92: Why? [15:10:01] Most extensions that we enable on demand are in there [15:10:10] The rest should be added via addWiki at creation [15:10:14] !log European SWAT completed [15:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:42] Reedy: because of those make it easier for future cases [15:11:41] I think most of them are there [15:13:19] !log Chris moved cr2-eqiad:xe-5/0/3 to xe-3/3/2 [15:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:05] !log Reactivate cr2-eqiad BGP peering with pfw1-eqiad [15:14:10] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm::admin: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304476 (owner: 10Muehlenhoff) [15:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:30] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/320206 (owner: 10Giuseppe Lavagetto) [15:16:49] (03CR) 10Alexandros Kosiaris: Maps - tilerator on all maps servers needs access to postgresql master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [15:18:01] (03PS1) 10Marostegui: Repool db2042 - the maintenance is post poned as db2034 has hardware issues and cannot even receive all the data (T149553#2776069) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320207 [15:18:16] (03PS2) 10Marostegui: Repool db2042 - the maintenance is post poned as db2034 has hardware issues and cannot even receive all the data (T149553#2776069) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320207 [15:18:18] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2776465 (10Halfak) The myspell packages are still needed. If some aren't available, we can replace them with the best available aspell or hunspell packages. [15:18:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [15:18:37] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:18:45] (03CR) 10Jcrespo: "+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320207 (owner: 10Marostegui) [15:19:23] !log Restarting Jenkins (deadlock in beta cluster Jenkins jobs) [15:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:28] 06Operations, 10Monitoring: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2776474 (10Joe) >>! In T150061#2776176, @Volans wrote: > @fgiunchedi @akosiaris @Joe > From the looks of it (I've take just a quick look, correct me if I'm wrong): > - Puppet spam is caused by a missing... [15:20:34] (03CR) 10Marostegui: [C: 032] Repool db2042 - the maintenance is post poned as db2034 has hardware issues and cannot even receive all the data (T149553#2776069) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320207 (owner: 10Marostegui) [15:21:25] (03Merged) 10jenkins-bot: Repool db2042 - the maintenance is post poned as db2034 has hardware issues and cannot even receive all the data (T149553#2776069) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320207 (owner: 10Marostegui) [15:21:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [15:22:47] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2042 - T149553 (duration: 00m 49s) [15:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:53] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [15:23:49] 06Operations, 10Monitoring: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2776482 (10akosiaris) >>! In T150061#2776176, @Volans wrote: > @fgiunchedi @akosiaris @Joe > From the looks of it (I've take just a quick look, correct me if I'm wrong): > - Puppet spam is caused by a m... [15:23:59] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:26:34] !log rebooting kafka1013 for kernel upgrades [15:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:19] (03PS1) 10ArielGlenn: no bare exceptions; all exceptions use "except blah as ex" [dumps] - 10https://gerrit.wikimedia.org/r/320208 [15:29:09] (03PS3) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) [15:30:55] (03CR) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [15:31:03] !log Disabling OSPF/OSPF3 on cr2-codfw:xe-5/0/1 for eqiad side port move [15:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:26] (03CR) 10Alexandros Kosiaris: [C: 031] Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [15:33:21] (03PS1) 10ArielGlenn: maxint -> maxsize (python 3 fix) [dumps] - 10https://gerrit.wikimedia.org/r/320209 [15:38:20] !log Reenabling OSPF/OSPF3 on cr2-codfw:xe-5/0/1 after eqiad side port move to xe-3/2/3 [15:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:36] (03CR) 10Elukey: "https://grafana.wikimedia.org/dashboard/db/kafka?panelId=34 shows some good examples about how nf_conntrack varies over time. Usually it " [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [15:39:52] !log rebooting radium for kernel update [15:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:40] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [15:43:11] 06Operations, 10Monitoring: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2776531 (10Joe) Ok, after a few puppet runs it seems clear to me that ordering is not constant anymore when coming from puppetdb. I will try to set the order in the puppetdb query first. [15:43:14] ah snap [15:43:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:44:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:44:19] !log started kafka-mirror-main-eqiad_to_analytics.service on kafka1012 [15:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:44:40] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [15:46:21] (03PS1) 10ArielGlenn: use sorted() everywhere instead of sort() (ptyhon 3 change) [dumps] - 10https://gerrit.wikimedia.org/r/320211 [15:47:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:50:00] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2776552 (10RobH) a:03RobH [15:50:58] (03PS2) 10Mark Bergsma: Reflect new FPC3 ports after cr1-/cr2-eqiad FPC5 decommissioning [dns] - 10https://gerrit.wikimedia.org/r/319617 [15:52:00] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:52:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:54:35] 06Operations, 10netops: HTCP purges flood across CODFW - https://phabricator.wikimedia.org/T133387#2776585 (10akosiaris) FTR, this still holds true today. There isn't really any reason it should have been fixed, just noting it. [15:54:44] !log Disabling cr2-eqiad BGP groups IX4/IX6 (all Equinix Ashburn BGP sessions) [15:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:10] (03Abandoned) 10ArielGlenn: Fixed PEP-8 issues [dumps] - 10https://gerrit.wikimedia.org/r/207504 (owner: 10Dereckson) [15:57:53] (03PS2) 10ArielGlenn: comment cleanup [dumps] - 10https://gerrit.wikimedia.org/r/207712 (owner: 10Dereckson) [16:00:25] !log Chris moved cr2-eqiad:xe-5/3/3 to xe-3/3/3 [16:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:57] (03PS5) 10Ottomata: eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) [16:01:45] !log Reactivated cr2-eqiad IX6 BGP group (ipv6 sessions) [16:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:10] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:27] (03CR) 10jenkins-bot: [V: 04-1] eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [16:04:21] (03CR) 10ArielGlenn: [C: 032] use sorted() everywhere instead of sort() (ptyhon 3 change) [dumps] - 10https://gerrit.wikimedia.org/r/320211 (owner: 10ArielGlenn) [16:07:22] (03CR) 10ArielGlenn: [C: 032] no bare exceptions; all exceptions use "except blah as ex" [dumps] - 10https://gerrit.wikimedia.org/r/320208 (owner: 10ArielGlenn) [16:07:45] (03CR) 10ArielGlenn: [C: 032] maxint -> maxsize (python 3 fix) [dumps] - 10https://gerrit.wikimedia.org/r/320209 (owner: 10ArielGlenn) [16:08:39] (03CR) 10ArielGlenn: [C: 032] comment cleanup [dumps] - 10https://gerrit.wikimedia.org/r/207712 (owner: 10Dereckson) [16:11:26] (03CR) 10ArielGlenn: "these scripts look a lot different (now that the production scripts are in master branch), I need to see if they are still sh only syntax " [dumps] - 10https://gerrit.wikimedia.org/r/207694 (owner: 10Dereckson) [16:13:16] (03CR) 10Gilles: "I see that firejail has support for rlimits. It might be what's getting in the way and we can try to use that instead." [puppet] - 10https://gerrit.wikimedia.org/r/319802 (https://phabricator.wikimedia.org/T145878) (owner: 10Gilles) [16:20:40] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2013_v6, cp2016_v6 [16:20:50] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2013_v6, cp2016_v6 [16:20:50] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2013_v6, cp2016_v6 [16:20:50] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2018_v6 [16:20:50] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:20:50] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:20:51] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:00] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6 [16:21:00] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6 [16:21:00] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2013_v6, cp2016_v6 [16:21:00] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:10] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:21:10] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 142 connecting: cp2013_v6, cp2014_v6, cp2015_v6, cp2016_v6, cp2017_v6, cp2018_v6 [16:21:10] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 35 connecting: (unnamed), cp1048_v6, cp1049_v6, cp1050_v6, cp1062_v6, cp1063_v6, cp1064_v6, cp1071_v6, cp1072_v6, cp1073_v6, cp1074_v6, cp1099_v6,kafka1012_v6,kafka1014_v6,kafka1020_v6,kafka1022_v6 not-conn: cp3034_v6, cp3035_v6, cp3036_v6, cp3037_v6, cp3038_v6, cp3039_v6, cp3044_v6, cp3045_v6, cp3046_v6, cp3047_v6, cp3048_v6, cp3049_v6, cp4005_v6, cp4006_v6, cp40 [16:21:10] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 142 connecting: cp2013_v6, cp2014_v6, cp2015_v6, cp2016_v6, cp2017_v6, cp2018_v6 [16:21:10] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2018_v6 [16:21:11] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2015_v6 [16:21:11] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2018_v6 [16:21:12] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2015_v6 [16:21:12] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2018_v6 [16:21:13] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2015_v6 [16:21:13] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2018_v6 [16:21:14] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2015_v6 [16:21:14] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:20] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:21:20] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 35 connecting: (unnamed), cp1048_v6, cp1049_v6, cp1050_v6, cp1062_v6, cp1063_v6, cp1064_v6, cp1071_v6, cp1072_v6, cp1073_v6, cp1074_v6, cp1099_v6, cp3034_v6, cp3035_v6, cp3038_v6, cp3039_v6, cp3045_v6, cp3047_v6, cp4005_v6, cp4006_v6, cp4013_v6, cp4014_v6, cp4015_v6,kafka1012_v6,kafka1014_v6,kafka1020_v6,kafka1022_v6 not-conn: cp3036_v6, cp3037_v6, cp3044_v6, cp30 [16:21:20] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:21:20] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 142 connecting: cp2013_v6, cp2014_v6, cp2015_v6, cp2016_v6, cp2017_v6, cp2018_v6 [16:21:20] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:21] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:21] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:30] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:21:30] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2018_v6 [16:21:30] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6 [16:21:30] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2018_v6 [16:21:30] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:31] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2015_v6 [16:21:31] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2015_v6 [16:21:32] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:32] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:33] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:21:33] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 18 connecting: (unnamed), cp1046_v6, cp1047_v6, cp1059_v6, cp1060_v6, cp3003_v6, cp3004_v6, cp4011_v6, cp4012_v6,kafka1012_v6,kafka1014_v6,kafka1020_v6,kafka1022_v6 not-conn: cp3005_v6, cp3006_v6, cp4019_v6, cp4020_v6,kafka1013_v6,kafka1018_v6 [16:21:34] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:34] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2014_v6, cp2017_v6 [16:21:35] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2015_v6 [16:21:50] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6 [16:21:51] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:21:51] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:21:51] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 142 connecting: cp2013_v6, cp2014_v6, cp2015_v6, cp2016_v6, cp2017_v6, cp2018_v6 [16:21:51] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 27 connecting: cp2018_v6 [16:22:02] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2018_v6 [16:22:02] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:22:02] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v6, cp2017_v6 [16:22:02] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2013_v6, cp2016_v6 [16:22:14] (03PS1) 10Gilles: Define Thumbor file size rlimit in firejail, not systemd [puppet] - 10https://gerrit.wikimedia.org/r/320216 (https://phabricator.wikimedia.org/T145878) [16:23:20] (03CR) 10Gilles: "Verified on beta: gilles@deployment-imagescaler01:~$ systemctl status thumbor@8801" [puppet] - 10https://gerrit.wikimedia.org/r/320216 (https://phabricator.wikimedia.org/T145878) (owner: 10Gilles) [16:24:30] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [16:24:30] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [16:24:30] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 24 ESP OK [16:24:30] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 28 ESP OK [16:24:30] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [16:24:31] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [16:24:31] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [16:24:32] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [16:24:32] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [16:24:33] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [16:24:33] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK [16:24:34] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [16:24:34] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [16:24:35] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [16:24:50] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [16:24:50] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [16:24:50] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [16:24:50] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [16:24:50] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [16:24:51] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [16:24:51] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [16:24:52] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [16:24:52] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [16:24:53] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [16:24:53] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [16:25:00] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [16:25:00] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 24 ESP OK [16:25:00] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [16:25:00] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [16:25:00] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [16:25:01] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [16:25:01] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [16:25:02] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [16:25:10] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [16:25:10] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [16:25:10] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [16:25:10] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [16:25:10] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 24 ESP OK [16:25:11] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [16:25:11] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 28 ESP OK [16:25:12] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 28 ESP OK [16:25:12] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [16:25:13] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 28 ESP OK [16:25:13] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 28 ESP OK [16:25:14] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 28 ESP OK [16:25:14] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [16:25:20] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [16:25:20] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [16:25:20] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [16:25:20] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [16:25:20] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [16:25:21] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [16:25:21] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [16:28:49] (03CR) 10ArielGlenn: "Indeed these scripts now have bashisms in them. I am open to patchsets that would keep the functionality and make them sh compatible." [dumps] - 10https://gerrit.wikimedia.org/r/207694 (owner: 10Dereckson) [16:29:17] 06Operations, 10netops: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2776789 (10mark) a:03mark All ports have been moved off of FPC5 on both routers, and all configuration for FPC5 ports has been removed. The Equinix Ashburn IXP port is still awaiting the... [16:30:10] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:30:21] (03PS3) 10Mark Bergsma: Reflect new FPC3 ports after cr1-/cr2-eqiad FPC5 decommissioning [dns] - 10https://gerrit.wikimedia.org/r/319617 (https://phabricator.wikimedia.org/T149196) [16:34:08] (03PS1) 10EBernhardson: Setup CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320220 (https://phabricator.wikimedia.org/T149740) [16:34:30] by the way, i guess T131385 should have been deployed months ago hashar ? [16:34:30] T131385: Dynamically fiddle with wgLocalDatabases to recognise wikitech separation - https://phabricator.wikimedia.org/T131385 [16:35:23] arseny92: I have no idea [16:36:01] as you triaged normal and prod impact [16:37:35] the patch looks good i guess [16:42:33] arseny92: I just quickly triaged that task on the wikimedia-log-errors board. I am not involved in having it fixed [16:51:23] 06Operations, 06Discovery, 06Maps: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2776874 (10Gehel) A quick look at graphite1001 indicates that we already publish ~ 64k metrics for Kartotherian: ``` gehel@graphite1001:~$ find /var/lib/carbon/... [16:55:10] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [16:56:10] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3077470 keys, up 7 days 8 hours - replication_delay is 0 [16:57:30] moritzm: am curious, and maybe you know, will kernel live patching come to debian sometime soon (does it exist already?) [16:58:22] (03PS1) 10Muehlenhoff: statistics::packages: Remove zpubsub [puppet] - 10https://gerrit.wikimedia.org/r/320227 (https://phabricator.wikimedia.org/T150003) [16:59:05] ottomata: no, not anytime soon. the generic in-kernel support isn't complete yet [16:59:20] oh hm, thought i saw something saying it was. aye cool [17:00:00] Ubuntu offers some live patching service since a few weeks, but it's useless [17:08:53] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2776958 (10MoritzMuehlenhoff) >>! In T150003#2776465, @Halfak wrote: > The myspell packages are still needed. If some aren't available, we can replace them with the... [17:15:08] (03CR) 10Faidon Liambotis: [C: 04-1] Reflect new FPC3 ports after cr1-/cr2-eqiad FPC5 decommissioning (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/319617 (https://phabricator.wikimedia.org/T149196) (owner: 10Mark Bergsma) [17:16:01] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 2 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2776977 (10mark) Approved. [17:17:38] (03CR) 10Hashar: [C: 031] "We can drop the 'zuul.eqiad.wmnet' DNS entry now. The last user was Nodepool and it is now pointing directly to contint1001.wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/319675 (owner: 10Hashar) [17:22:44] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2777038 (10Ottomata) Or, for now, we could include these from somewhere other than the class that will be included on the notebook hosts. Or conditionally only inclu... [17:23:33] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2777040 (10Gilles) >>! In T66214#2772934, @Tgr wrote: > Projects which can get away with being Wikimedia-only (such as the mobile apps) could just use... [17:24:25] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 2 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2777041 (10jcrespo) a:05jcrespo>03None @RobH You mentioned it may not need a physical movement... [17:28:05] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 2 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2777052 (10RobH) I said that in reply to the labs to db server transition, not the transition of t... [17:29:51] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2777054 (10AlexMonk-WMF) >>! In T117095#2775967, @Kelson wrote: > We can not really do monthly snapshot (what would a good thing) and we have pretty serious difficulties to create ZIM... [17:30:11] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 2 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2777055 (10jcrespo) Sorry for the missunderstanding! [17:36:43] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2777063 (10MoritzMuehlenhoff) I've rebuilt all the dict packagesl (seven were imported from trusty and two from xenial since they failed to build on jessie), uploads... [17:37:25] godog: can you please either enable puppet and updates on filippo-test-trusty or delete the instance? It's been needing a lot of special attention lately :( [17:37:52] jouncebot: now [17:37:52] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [17:37:56] jouncebot: next [17:37:57] In 0 hour(s) and 22 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T1800) [17:38:29] I've got a backport that will help with ELK logging issues that we are having. [17:38:49] greg-g: can I sneak this out? -- https://gerrit.wikimedia.org/r/#/c/320232/ [17:39:01] andrewbogott: yeah delete it if it causing problems [17:39:10] godog: ok, thanks [17:39:23] bd808: sneak away [17:40:36] !log performing schema change on s7 (imagelinks) T139090 [17:40:40] 06Operations, 06Performance-Team, 10Thumbor: Make Thumbor IM engine based on a subprocess - https://phabricator.wikimedia.org/T149903#2777087 (10Gilles) Might not be necessary after all if this works out: https://gerrit.wikimedia.org/r/#/c/319807/ [17:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:42] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [17:42:29] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2777093 (10MoritzMuehlenhoff) The packages were imported from trusty and built for jessie-wikimedia: dict-nr_20070206-4ubuntu1+wmf1_amd64.changes dict-ns_20070206-4ub... [17:44:10] * bd808 twiddles thumbs while jenkins does its thing [17:44:18] (03CR) 10Ottomata: [C: 031] statistics::packages: Remove zpubsub [puppet] - 10https://gerrit.wikimedia.org/r/320227 (https://phabricator.wikimedia.org/T150003) (owner: 10Muehlenhoff) [17:46:02] brion: Ping? [17:46:54] (03CR) 10Muehlenhoff: [C: 032] statistics::packages: Remove zpubsub [puppet] - 10https://gerrit.wikimedia.org/r/320227 (https://phabricator.wikimedia.org/T150003) (owner: 10Muehlenhoff) [17:55:58] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2777136 (10MoritzMuehlenhoff) So this is down to one package now: python-pygeoip seems to be an internal package. Is that still needed? Debian has python-geoip, maybe... [17:56:26] !log bd808@tin Synchronized php-1.29.0-wmf.1/includes/exception/MWExceptionHandler.php: MWExceptionHandler: Do not use 'exception' for custom log data (T150106) (duration: 00m 47s) [17:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:32] T150106: MediaWiki logs on the EventBus channel causing indexing failures in ELK Elasticsearch - https://phabricator.wikimedia.org/T150106 [17:57:55] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2777154 (10AlexMonk-WMF) >>! In T150058#2775860, @akosiaris wrote: > @AlexMonk-WMF As @Joe said, we 've copied over the CA in production (2 months ago... [17:58:41] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2777157 (10akosiaris) >>! In T150058#2777154, @AlexMonk-WMF wrote: >>>! In T150058#2775860, @akosiaris wrote: >> @AlexMonk-WMF As @Joe said, we 've co... [18:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T1800). [18:00:06] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2777159 (10Halfak) >>! In T150003#2776958, @MoritzMuehlenhoff wrote: > None of these languages have an aspell or hunspell package in Debian. So, if these are in fact... [18:00:34] !log T133395: Convert local_group_*_title__revisions.{data,idx_by_rev_ever} tables to time-window compaction [18:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:40] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [18:03:16] SMalyshev: latest wdqs deployed on beta, seems not working, I'm having a look [18:03:43] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2777166 (10MoritzMuehlenhoff) https://github.com/wikimedia/operations-puppet/blob/production/modules/ores/manifests/base.pp#L20 is only a subset of dictionaries, the... [18:03:49] (03PS4) 10Elukey: First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) [18:06:46] (03PS2) 10EBernhardson: Setup CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320220 (https://phabricator.wikimedia.org/T149740) [18:07:31] (03CR) 10jenkins-bot: [V: 04-1] Setup CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320220 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [18:07:34] (03PS3) 10EBernhardson: Setup CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320220 (https://phabricator.wikimedia.org/T149740) [18:08:03] (03CR) 10DCausse: Setup CirrusSearch interwiki load test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320220 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [18:09:30] (03PS3) 10Dzahn: confluent::kafka::mirror::jmxtrans: key attr is declared more than once [puppet] - 10https://gerrit.wikimedia.org/r/319770 [18:09:42] (03CR) 10Dzahn: [C: 032] confluent::kafka::mirror::jmxtrans: key attr is declared more than once [puppet] - 10https://gerrit.wikimedia.org/r/319770 (owner: 10Dzahn) [18:09:49] grrrit-wm: restart [18:09:51] re-connecting to gerrit [18:09:52] reconnected to gerrit [18:12:53] (03PS4) 10EBernhardson: Setup CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320220 (https://phabricator.wikimedia.org/T149740) [18:14:22] (03PS5) 10Elukey: First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) [18:15:16] (03CR) 10EBernhardson: Setup CirrusSearch interwiki load test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320220 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [18:15:54] (03PS2) 10Dzahn: Add projectcom.wikimedia.org to Apache [puppet] - 10https://gerrit.wikimedia.org/r/319123 (https://phabricator.wikimedia.org/T143138) (owner: 10Dereckson) [18:20:46] (03CR) 10Paladox: [C: 031] Add projectcom.wikimedia.org to Apache [puppet] - 10https://gerrit.wikimedia.org/r/319123 (https://phabricator.wikimedia.org/T143138) (owner: 10Dereckson) [18:28:38] 06Operations, 10Citoid, 10VisualEditor, 06Services (blocked): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#2777252 (10Mvolz) >>! In T92304#2767678, @mobrovac wrote: > Nope, @Mvolz that's not it. I will likely delete that instance as it's... [18:28:48] 06Operations, 10Citoid, 10VisualEditor, 06Services (blocked): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#2777253 (10Mvolz) >>! In T92304#2767678, @mobrovac wrote: > Nope, @Mvolz that's not it. I will likely delete that instance as it's... [18:36:58] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2777285 (10Gehel) a:03RobH [18:41:01] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2777320 (10grin) >>! In T141815#2773589, @Gehel wrote: > @debt: as @BBlack pointed out in the start of this thread, we tend to have a fairly liberal view on who can r... [18:51:29] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2777347 (10Andrew) [18:55:39] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2513156 (10Deskana) I think this task is conflating two separate issues. - Define //technical// tile usage policy (i.e. what users can do without melting the servers... [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T1900). Please do the needful. [19:06:05] (03CR) 10DCausse: [C: 031] Setup CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320220 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [19:08:13] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2777401 (10Andrew) @_joe_, can you please amend (or provide a new version) of the RFC incorporating the agreements here? I've re-read this thread but a... [19:11:30] (03PS1) 10Dzahn: base: also install freeipmi on trusty hosts [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [19:12:43] (03PS2) 10Dzahn: base: also install freeipmi on trusty hosts [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [19:13:58] 06Operations, 13Patch-For-Review: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2777460 (10Ottomata) Hm, I don't really know why we need python-pygeoip. It seems like python-geoip should be sufficient, but likely the API is different and pygeoip... [19:17:54] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2777484 (10Gehel) @Deskana as always, you have the word of wisdom! In any case, this is a discussion that will take some time, and that will need to be rolled out pr... [19:21:20] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2777511 (10DarTar) @Ottomata @Nuria: anything else you need from us to get this request processed? Please let us know. [19:21:22] ottomata: re: jmx_exporter and kafka mirror maker, the easiest I think would be to run it as a 'java agent', e.g. https://github.com/prometheus/jmx_exporter#building-and-running [19:22:57] yeah saw that, we'd have to either patch the confluent deb package, since it provides a CLI wrapper for launching the mirror maker JVM...or just bypass that and run our own java command to launch the process in our systemd unit [19:23:15] java agent sounds nicer, since then the service runs with the jvm [19:23:27] fewer services to manage/monitor [19:24:20] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2777525 (10Nuria) @DarTar: nothing else is needed, ops will get to it as they have the bandwidth [19:25:57] ottomata: yeah, should be easy to test too, curl the port passed on the command line and that will return the metrics [19:26:32] 06Operations, 10puppet-compiler: puppet compiler claims "no change" when catalogs are actually different - https://phabricator.wikimedia.org/T149432#2777560 (10Volans) I can confirm the bug, example here: https://puppet-compiler.wmflabs.org/4560/analytics1001.eqiad.wmnet/ Partial diff: ``` 25284a25317,25332 >... [19:26:49] godog: and, if we get this data into prometheus, we will still be able to work with it in grafana, yes? [19:26:49] ottomata: and the next step is to tell prometheus about the hosts that run mirrormaker and on what port [19:26:56] aye [19:27:23] ottomata: yeah grafana has both graphite and prometheus in each datacenter as sources [19:27:37] (03PS4) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) [19:28:24] (03CR) 10Volans: [C: 031] "Change LGTM and if we have freeipmi on jessie's hosts I don't see why not on trusty too, so the monitoring effort could be applied to both" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [19:29:14] (03CR) 10jenkins-bot: [V: 04-1] Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [19:29:30] RECOVERY - Host lvs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:29:40] RECOVERY - Host lvs1009 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:29:40] RECOVERY - configured eth on lvs1008 is OK: OK - interfaces up [19:29:50] (03PS5) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) [19:29:50] RECOVERY - configured eth on lvs1009 is OK: OK - interfaces up [19:33:10] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [19:33:30] PROBLEM - Host lvs1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:00] PROBLEM - Host lvs1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:10] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3078048 keys, up 7 days 11 hours - replication_delay is 0 [19:34:10] PROBLEM - Host lvs1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:20] RECOVERY - Host lvs1011 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:34:30] RECOVERY - Host lvs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:34:40] PROBLEM - Host lvs1009 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:40] PROBLEM - Host lvs1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:38:30] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:40:27] 06Operations, 10Ops-Access-Requests: Requesting access to fluorine for matanya - https://phabricator.wikimedia.org/T149832#2777662 (10Matanya) [19:42:41] 06Operations, 10Ops-Access-Requests: Requesting access to fluorine for matanya - https://phabricator.wikimedia.org/T149832#2777681 (10Matanya) Public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDaREf06EZa5CoRLc7sUNqZqDJVIKIxaxpsMsiG98QixtEEGQLw6W0CCwRwls+NVunzERumXJsKaAPBX5ptn5wH4ZXMfrD1y9CWqQFYkImKI4BJjCCef7x... [19:44:30] RECOVERY - Host lvs1008 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [19:45:08] robh: can you please verify i filled this^^ request correctly ? [19:45:44] is the lvs bouncing expected? [19:46:28] (03CR) 10Dzahn: [C: 032] "tested on mw1017" [puppet] - 10https://gerrit.wikimedia.org/r/319123 (https://phabricator.wikimedia.org/T143138) (owner: 10Dereckson) [19:46:30] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:47:52] matanya: checking [19:47:59] thanks [19:48:30] 06Operations, 10Wikimedia-General-or-Unknown, 10hardware-requests: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2777690 (10Reedy) [19:49:44] matanya: these are mediawiki logs you want right? [19:49:52] so group mw-log-readers: [19:49:52] yes robh [19:50:08] and your wikitech username is same as your wanted username and irc name right? [19:50:10] PROBLEM - check_redis on payments1002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [19:50:13] we base your uid off that [19:50:52] 06Operations, 10Wikimedia-General-or-Unknown, 10hardware-requests: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2773188 (10Matanya) probably adding a GPU might be wise as well. [19:51:02] yes robh [19:51:22] yeah i'll get this working for you shortly [19:51:26] 06Operations, 06Performance-Team, 10Thumbor: Investigate why oom_kill mtail program doesn't work properly - https://phabricator.wikimedia.org/T149980#2777696 (10fgiunchedi) Indeed that's odd, looks like mtail stops pushing metrics to graphite. Now lithium is running with a mtail version from upstream plus my... [19:51:29] its past the three days =] [19:51:30] thanks! [19:51:40] 06Operations, 10Continuous-Integration-Infrastructure: On Trusty and Jessie PHP yields: PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/20-xhprof.ini on line 2 - https://phabricator.wikimedia.org/T135338#2777697 (10hashar) 05declined>03Open [19:52:15] (03PS3) 10Dereckson: add projectcom.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [19:52:54] 06Operations, 10Continuous-Integration-Infrastructure: On Trusty and Jessie PHP yields: PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/20-xhprof.ini on line 2 - https://phabricator.wikimedia.org/T135338#2295773 (10hashar) I have reopened the task, the notice is quite annoyin... [19:53:27] Hi mutante. Could you also take care of the DNS change please? If so, I can setup it as the same time than ec.wikimedia.org this evening. [19:53:30] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:55:07] (03PS1) 10RobH: granting matanya shell access [puppet] - 10https://gerrit.wikimedia.org/r/320256 (https://phabricator.wikimedia.org/T149832) [19:55:10] PROBLEM - check_redis on payments1002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [19:55:56] (03CR) 10RobH: [C: 032] granting matanya shell access [puppet] - 10https://gerrit.wikimedia.org/r/320256 (https://phabricator.wikimedia.org/T149832) (owner: 10RobH) [19:56:24] grrr, check_redis on payments* is not a real problem, it's just taking forever for icinga to pick up config changes [19:58:38] Dereckson: i am aware of that and was waiting for half an hour until puppet ran across the fleet. last time we added to DNS first, clicked the URL, and then got a cached errror page in varnish [19:58:45] 06Operations, 10Ops-Access-Requests: Requesting access to fluorine for matanya - https://phabricator.wikimedia.org/T149832#2777721 (10RobH) 05Open>03Resolved a:03RobH Access granted and its now merged live on fluorine. Depending on which bastion server you access via, it may take up to 30 minutes for th... [19:58:52] matanya: access is live, what bastion would you acces via? [19:59:01] 3001 [19:59:01] im happy to manually kick puppet to run immediately on it, already did for fluorine [19:59:12] many many thanks [19:59:14] running [19:59:45] Dereckson: what time is your slot? [19:59:59] i saw you are also doing "ec".wm at the same time? cool! [20:00:10] PROBLEM - check_redis on payments1002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [20:00:55] matanya: ok, access is setup on bast3001+fluorine [20:01:00] the other bastions will get them when they auto call [20:01:02] mutante: before the evening SWAT, 21-23:00 UTC [20:01:03] you should be all set [20:01:24] thanks robh ! what is the dns name for fluorine ? wmnet ? [20:01:30] eqiad.wmnet [20:01:32] yep [20:03:14] Dereckson: i'll get it done in time, in a couple minutes [20:05:10] PROBLEM - check_redis on payments1002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [20:05:16] (03PS1) 10BBlack: maps VCL: clean up frontend file [puppet] - 10https://gerrit.wikimedia.org/r/320257 [20:05:18] (03PS1) 10BBlack: VCL: add backend_response_early hooks [puppet] - 10https://gerrit.wikimedia.org/r/320258 (https://phabricator.wikimedia.org/T131503) [20:05:20] (03PS1) 10BBlack: Text VCL: fixup beresp.Cookie for Vary before hfp [puppet] - 10https://gerrit.wikimedia.org/r/320259 (https://phabricator.wikimedia.org/T131503) [20:05:22] (03PS1) 10BBlack: Text VCL: avoid creating empty Cookie header [puppet] - 10https://gerrit.wikimedia.org/r/320260 (https://phabricator.wikimedia.org/T131503) [20:05:33] i am in robh, closing the ticket as resolved, again, thank you, this will save me a lot of time (or use a lot of my time, depends on your view;)) [20:05:39] mutante: thanks [20:05:45] (03PS2) 10Dzahn: Remove zuul.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/319675 (owner: 10Hashar) [20:06:02] (03CR) 10Dzahn: [C: 032] Remove zuul.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/319675 (owner: 10Hashar) [20:06:03] oh, you already closed it [20:06:24] (03Abandoned) 10Hashar: zuul: stop managing unix user/group [puppet] - 10https://gerrit.wikimedia.org/r/315902 (owner: 10Hashar) [20:07:09] (03PS1) 10Andrew Bogott: Wikistatus: Delete pages for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/320261 (https://phabricator.wikimedia.org/T140298) [20:09:01] jouncebot: refresh [20:09:05] I refreshed my knowledge about deployments. [20:09:10] PROBLEM - Host lvs1008 is DOWN: PING CRITICAL - Packet loss = 100% [20:09:52] the host is up [20:10:07] (03PS2) 10Andrew Bogott: Wikistatus: Delete pages for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/320261 (https://phabricator.wikimedia.org/T140298) [20:10:48] (03PS2) 10BBlack: Text VCL: avoid creating empty Cookie header [puppet] - 10https://gerrit.wikimedia.org/r/320260 (https://phabricator.wikimedia.org/T131503) [20:10:51] (03PS2) 10BBlack: VCL: add backend_response_early hooks [puppet] - 10https://gerrit.wikimedia.org/r/320258 (https://phabricator.wikimedia.org/T131503) [20:10:53] (03PS2) 10BBlack: Text VCL: fixup beresp.Cookie for Vary before hfp [puppet] - 10https://gerrit.wikimedia.org/r/320259 (https://phabricator.wikimedia.org/T131503) [20:12:12] mutante: yeah one nic on e.g. lvs1008 flapped though, maybe related to some network or physical work cmjohnson1 mark ? [20:12:20] RECOVERY - Host lvs1010 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:12:39] (03Abandoned) 10Dereckson: More UNIX agnostic, less GNU/Linux-centric scripts [dumps] - 10https://gerrit.wikimedia.org/r/207694 (owner: 10Dereckson) [20:12:41] (03CR) 10Andrew Bogott: [C: 032] Wikistatus: Delete pages for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/320261 (https://phabricator.wikimedia.org/T140298) (owner: 10Andrew Bogott) [20:13:18] godog: yep....it's lvs to row D [20:13:33] working on it right now [20:13:49] thanks [20:14:16] cmjohnson1: ack, thanks [20:14:31] !log cmjohnson1 is performing work on LVS in row D, there might be flaps [20:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:07] (03PS3) 10BBlack: Text VCL: avoid creating empty Cookie header [puppet] - 10https://gerrit.wikimedia.org/r/320260 (https://phabricator.wikimedia.org/T131503) [20:15:09] (03PS2) 10BBlack: maps VCL: clean up frontend file [puppet] - 10https://gerrit.wikimedia.org/r/320257 [20:15:11] (03PS3) 10BBlack: VCL: add backend_response_early hooks [puppet] - 10https://gerrit.wikimedia.org/r/320258 (https://phabricator.wikimedia.org/T131503) [20:15:13] (03PS3) 10BBlack: Text VCL: fixup beresp.Cookie for Vary before hfp [puppet] - 10https://gerrit.wikimedia.org/r/320259 (https://phabricator.wikimedia.org/T131503) [20:15:30] jouncebot: next [20:15:30] In 0 hour(s) and 44 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T2100) [20:15:30] In 0 hour(s) and 44 minute(s): Add wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T2100) [20:15:30] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:15:35] (03CR) 10BBlack: [C: 032 V: 032] maps VCL: clean up frontend file [puppet] - 10https://gerrit.wikimedia.org/r/320257 (owner: 10BBlack) [20:15:56] (03PS2) 10Filippo Giunchedi: Define Thumbor file size rlimit in firejail, not systemd [puppet] - 10https://gerrit.wikimedia.org/r/320216 (https://phabricator.wikimedia.org/T145878) (owner: 10Gilles) [20:16:10] (03CR) 10Dzahn: [C: 032] add projectcom.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [20:16:14] (03PS4) 10Dzahn: add projectcom.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) [20:16:22] (03CR) 10BBlack: [C: 032 V: 032] VCL: add backend_response_early hooks [puppet] - 10https://gerrit.wikimedia.org/r/320258 (https://phabricator.wikimedia.org/T131503) (owner: 10BBlack) [20:17:01] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: fixup beresp.Cookie for Vary before hfp [puppet] - 10https://gerrit.wikimedia.org/r/320259 (https://phabricator.wikimedia.org/T131503) (owner: 10BBlack) [20:17:26] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: avoid creating empty Cookie header [puppet] - 10https://gerrit.wikimedia.org/r/320260 (https://phabricator.wikimedia.org/T131503) (owner: 10BBlack) [20:21:12] !log projectcom.wikimedia.org created in DNS (T143138) [20:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:19] T143138: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138 [20:22:23] (03PS1) 10Reedy: Set $wgOATHAuthAccountPrefix to 'Wikimedia' for WMF CA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320266 [20:23:39] (03CR) 10Dzahn: "maybe could you -1 this until the dependencies are merged and then change it to +1 when ready to go?" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [20:24:32] (03CR) 10Dzahn: "adding Ariel because it's (kind of) related to the GC investigation" [puppet] - 10https://gerrit.wikimedia.org/r/317322 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [20:25:31] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2777819 (10fgiunchedi) [20:26:05] (03CR) 10Dzahn: "@paladox: is it intentional that this happens only on jessie?" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [20:26:27] Dereckson: If you're creating projectcom tonight... don't forget to add OATHAuth tables too [20:26:30] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:26:41] Reedy: ok [20:26:46] (03CR) 10Dzahn: "your commit message says "in php.pp" but that is not the case anymore" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [20:28:23] (03CR) 10Dzahn: [C: 04-1] "has this already been done in another patch? i think so, right? would need manual rebase at least" [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [20:28:43] (03CR) 10Filippo Giunchedi: [C: 032] Define Thumbor file size rlimit in firejail, not systemd [puppet] - 10https://gerrit.wikimedia.org/r/320216 (https://phabricator.wikimedia.org/T145878) (owner: 10Gilles) [20:28:48] (03PS3) 10Filippo Giunchedi: Define Thumbor file size rlimit in firejail, not systemd [puppet] - 10https://gerrit.wikimedia.org/r/320216 (https://phabricator.wikimedia.org/T145878) (owner: 10Gilles) [20:32:08] !log repooling cp4018 (done experimenting) [20:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:50] (03PS5) 10Hashar: Add puppet-lint to Rakefile / Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/288620 [20:35:52] (03PS5) 10Hashar: Only run puppet-lint against HEAD by default [puppet] - 10https://gerrit.wikimedia.org/r/288629 [20:37:40] (03CR) 10jenkins-bot: [V: 04-1] Only run puppet-lint against HEAD by default [puppet] - 10https://gerrit.wikimedia.org/r/288629 (owner: 10Hashar) [20:37:45] oh my god [20:38:33] !log upgrading new labsdbs to mariadb 10.1.19 [20:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:55] jynus: \o/ [20:40:30] RECOVERY - Host lvs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:40:40] RECOVERY - Host lvs1009 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:41:25] your puppetlint fix failed puppetlint? :) [20:43:14] heh. yea. 2>&1:fatal: ambiguous argument 'HEAD^': unknown revision or path not in the working tree. [20:44:29] !log T133395: restbase2001-b.codfw.wmnet: Performing user-defined compaction of la-169239-big-Data.db and la-172629-big-Data.db [20:44:30] RECOVERY - configured eth on lvs1007 is OK: OK - interfaces up [20:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:36] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [20:44:40] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [20:45:30] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [20:53:37] I am checking labsdb1009 bios- the boot order is right, but it keeps trying to boot from network [20:55:51] 06Operations, 06Performance-Team, 10Thumbor: Avoid thumbor generating log files > 1GB - https://phabricator.wikimedia.org/T150208#2777923 (10Gilles) [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T2100). Please do the needful. [21:00:04] Dereckson: Respected human, time to deploy Add wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T2100). Please do the needful. [21:00:29] I have a rather fast deployment of ORES [21:01:40] Okay, no objections? [21:03:37] no objection from me, since I'm still prepping deployment for mobileapps [21:04:30] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [21:04:40] oh no, google is dead! :) [21:04:48] :P [21:04:49] ohohhh [21:04:50] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 22.23 ms [21:06:10] racadm serveraction google up [21:07:14] (03PS1) 10Gilles: Rotate Thumbor 404 log by size, not date [puppet] - 10https://gerrit.wikimedia.org/r/320273 (https://phabricator.wikimedia.org/T150208) [21:08:43] (03CR) 10Chad: [C: 04-1] "Yes, this was already done. If we need extra options for the JVM, they should be added to the array in jetty.pp, not appended here." [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [21:08:53] !log deploying c61b9c1 from ORES into canary nodes (T149730) [21:09:08] (03CR) 10Chad: "This is actually probably fine, but will require a gerrit restart." [puppet] - 10https://gerrit.wikimedia.org/r/317322 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [21:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:13] T149730: Deploy logging changes to ORES - https://phabricator.wikimedia.org/T149730 [21:12:26] !log deploying c61b9c1 from ORES to all nodes (T149730) [21:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:04] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2778014 (10jcrespo) [21:18:04] !log starting Parsoid deploy [21:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:09] (03PS1) 10Madhuvishy: labstore: Add minute to crontab for backups to secondary DC [puppet] - 10https://gerrit.wikimedia.org/r/320276 [21:20:42] (03CR) 10Madhuvishy: [C: 032] labstore: Add minute to crontab for backups to secondary DC [puppet] - 10https://gerrit.wikimedia.org/r/320276 (owner: 10Madhuvishy) [21:21:10] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 659 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3079377 keys, up 7 days 12 hours - replication_delay is 659 [21:22:10] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3077976 keys, up 7 days 13 hours - replication_delay is 0 [21:29:03] (03PS2) 10Dereckson: Initial configuration for ec.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314471 (https://phabricator.wikimedia.org/T135521) [21:29:24] !log updated Parsoid to version 2c2fe425 [21:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:12] !log starting mobileapps deploy [21:30:15] (03PS3) 10Dereckson: Initial configuration for ec.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314471 (https://phabricator.wikimedia.org/T135521) [21:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:45] (03CR) 10Dereckson: "PS2: rebased (solved merge conflict against ptwikimedia logo). PS3: +wikiversions.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314471 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [21:32:17] okay, the ores deployment is done [21:35:11] Amir1: have you logged it in SAL? [21:35:26] bearND: Yup at the top [21:35:37] I log the start not the finish, I hope that's okay [21:36:12] Amir1: People usually log the start and the end [21:36:34] okay, I do it [21:36:38] to better correlate with any potential icinga alerts, i think [21:37:07] !log ores deployment c61b9c1 is done [21:37:28] bearND: Sure, thanks [21:37:33] 👍 [21:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:22] !log deployed mobileapps 4202cbb [21:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:30] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:50:42] !log deploying latest wdqs gui and blazegraph [21:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:28] SMalyshev: wdqs deployment completed, tests are green... [21:56:34] (03PS1) 10Dereckson: Initial configuration for projectcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320298 (https://phabricator.wikimedia.org/T143138) [21:57:35] You're all done for services deployment? [21:58:33] (03PS2) 10Dzahn: remove gallium from site.pp, installserver [puppet] - 10https://gerrit.wikimedia.org/r/318216 (https://phabricator.wikimedia.org/T95757) [21:59:50] (03CR) 10Dereckson: [C: 032] Initial configuration for ec.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314471 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [22:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161107T2200). Please do the needful. [22:00:28] (03Merged) 10jenkins-bot: Initial configuration for ec.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314471 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [22:03:40] !log Starting ec.wikimedia.org wiki creation [22:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:54] Dereckson: Let me know when you're done... Couple of patches to deploy for security [22:08:05] * Dereckson nods [22:08:23] (03CR) 10Dzahn: [C: 032] remove gallium from site.pp, installserver [puppet] - 10https://gerrit.wikimedia.org/r/318216 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [22:09:41] make that 3 :) [22:10:28] !log gallium - revoke puppet cert, deactivate node [22:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:09] almost deleted all salt keys, thanks salt-key for asking [N/y] [22:12:20] lol [22:13:26] -d and -D [22:14:30] Dereckson, any errors from the script? [22:14:37] Krenair: yes, but solved at pass 2 [22:14:50] was Cirrus this time, at populate step [22:15:11] !log gallium - delete salt key, minion is stopped [22:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:52] so database okay, let's sync the config [22:17:32] !log dereckson@tin Synchronized dblists: (no message) (duration: 00m 53s) [22:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:40] PROBLEM - salt-minion processes on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:17:59] ^ yea, that should have been gone after the "node deactivate" step above [22:18:01] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [22:18:03] hrmm [22:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:34] !log gallium - stopped apache, stopped salt, removed zuul cronjob [22:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:10] wow there goes gallium, yay! [22:19:24] mutante don't touch my salt stash :P [22:19:45] that's actually _my_ salt stash, if you wanna get picky about it :-P [22:20:01] heheh [22:20:16] anyways, a long time in coming, great to see it finishing up [22:20:34] :) [22:20:39] apergos i would reply in a funny manner but unfortunaly theres a shit ton going on so i dont think that make some people happy with me [22:20:48] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: ec.wikimedia initial configuration (T135521) (duration: 00m 47s) [22:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:53] yep let's let em work [22:20:54] T135521: Internal Wiki for Wikimedians of Ecuador - https://phabricator.wikimedia.org/T135521 [22:21:09] (03CR) 10Krinkle: Rotate Thumbor 404 log by size, not date (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/320273 (https://phabricator.wikimedia.org/T150208) (owner: 10Gilles) [22:21:40] !icinga tell einsteinium and tegmen to remove gallium [22:22:24] !log dereckson@tin Synchronized static/images/project-logos/: Logos for ec.wikimedia (T135521) (duration: 00m 48s) [22:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:26] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2778227 (10madhuvishy) Labstore2001 is up and running with 12 internal disks connected via H700 raid controller, and 48 external disks across... [22:25:44] !log Created tables for OATHAuth on ec.wikimedia [22:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:07] Dereckson what is this "ec.wikimedia"? [22:26:18] Wikimedistas de Ecuador [22:27:30] Reedy: I've ran addwiki (created the tables, sent the mail to the mailing list), sync'ed dblist, wikiversions, wmf-config/InitialiseSettings.php, but wiki doesn't appear at https://ec.wikimedia.org/ [22:28:19] What are the conditions to serve a wiki there? [22:28:44] ecwikimedia exists, check / wikiversion check [22:29:42] mwrepl ecwikimedia → echo $wgServerName gives me well ec.wikimedia.org too [22:30:51] ah [22:30:52] For *wikimedia databases, add the subdomain to the list in MWMultiVersion::setSiteInfoForWiki [22:31:07] Dereckson are DNS and stuff set up? [22:31:11] Zppix: yup [22:31:22] Zppix, please leave Dereckson alone, he has important things to do [22:31:34] Krenair im trying to help him out with why it wont show [22:31:44] Zppix: it's for Ecuador user group and yep, i added to DNS [22:32:43] You wouldn't get "no wiki found" with no dblist [22:32:46] *DNS [22:33:02] (03PS1) 10Dereckson: Add ec.wikimedia to MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320302 (https://phabricator.wikimedia.org/T135521) [22:33:24] (03CR) 10Reedy: [C: 032] Add ec.wikimedia to MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320302 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [22:33:25] Reedy, wow how'd i miss that part of the message [22:33:53] "Please specify a valid Host header." [22:34:01] (03Merged) 10jenkins-bot: Add ec.wikimedia to MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320302 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [22:34:23] Works on mw1099 [22:34:59] sync it out then :) [22:35:32] https://ec.wikimedia.org [22:35:32] * 200 OK 10518 [22:35:44] !log dereckson@tin Synchronized multiversion/MWMultiVersion.php: Add ec.wikimedia to MWMultiVersion (T135521) (duration: 00m 49s) [22:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:50] T135521: Internal Wiki for Wikimedians of Ecuador - https://phabricator.wikimedia.org/T135521 [22:35:59] Magic [22:36:11] it's there :) nice [22:36:25] mutante: yes, for *.wikimedia.org, there is a need to declare the code in the script calling MediaWiki [22:36:39] ah login required. I see. working at any rate [22:36:58] So, stewards can take care to fix login issues for private wikis of is something required there? [22:37:51] https://ec.wikimedia.org/ doesn't seem to be redirecting to https://ec.wikimedia.org/wiki/P%C3%A1gina_principal [22:38:01] But that's minor [22:38:19] Dereckson: And no, for a private wiki, stewards can't help [22:38:20] it sends you to the special:login page [22:38:30] and that's expected, Reedy [22:38:49] If you visit https://office.wikimedia.org [22:39:07] it redirects you to https://office.wikimedia.org/wiki/Main_Page [22:39:10] !log Un nuevo wiki ha nacido. Bienvenido grupo de usuarios Ecuador Wikimedia. https://ec.wikimedia.org (T135521) [22:39:14] But I guess that might've been whitelisted [22:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:21] * mutante "tweets" [22:39:22] sorry Dereckson can't help [22:39:31] I am guessing so [22:39:33] Dereckson: Need to find out who needs an account.. And create them one as 'crat [22:39:48] using createAndPromote.php [22:39:56] But that's minor and can be done by any shell user [22:40:30] Reedy: right, I'm done in this case, you can deploy the security fixes. I'll do interwiki and cleanup afterwards. [22:40:46] (03PS2) 10Reedy: Don't override message key in badpass log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319134 (owner: 10Brian Wolff) [22:40:51] (03CR) 10Reedy: [C: 032] Don't override message key in badpass log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319134 (owner: 10Brian Wolff) [22:40:54] All stewards can do with private wikis is grant/remove groups from their users [22:41:25] only if it is part of the main cluster [22:41:28] (03Merged) 10jenkins-bot: Don't override message key in badpass log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319134 (owner: 10Brian Wolff) [22:41:33] ACKNOWLEDGEMENT - salt-minion processes on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn decom T95757 [22:42:43] (03PS2) 10Reedy: Set $wgOATHAuthAccountPrefix to 'Wikimedia' for WMF CA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320266 [22:42:47] (03CR) 10Reedy: [C: 032] Set $wgOATHAuthAccountPrefix to 'Wikimedia' for WMF CA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320266 (owner: 10Reedy) [22:43:27] (03Merged) 10jenkins-bot: Set $wgOATHAuthAccountPrefix to 'Wikimedia' for WMF CA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320266 (owner: 10Reedy) [22:44:47] Reedy: checked a bunch of other private wikis, they all have main page visible, so dunno. [22:44:58] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Set wgOATHAuthAccountPrefix and Don't override message key in badpass log entries (duration: 00m 47s) [22:45:02] It'll be something or nothing [22:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:22] !log reedy@tin Synchronized php-1.29.0-wmf.1/includes/specials/: Deploy security fix T150044 (duration: 00m 54s) [22:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:54] Dereckson: I think you can continue now [22:53:15] * Dereckson nods [22:53:23] Thanks [22:53:59] looks like all private wikis should have main page whitelisted so I'm outa ideas [22:54:05] 'private' => [ 'Main Page', 'Special:UserLogin', 'Special:UserLogout' ], [22:54:16] Page doesn't exist or similar maybe? [22:54:25] mm I thought by default there was [22:54:31] It should be [22:56:24] they can create a redirection from [[Main Page]] to Pagina principal in this case [22:56:30] and that will solve the issue. [22:56:40] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.004 second response time [22:56:47] Dereckson: Oh, no [22:56:57] Acutally, yes [22:56:59] Something like that [22:57:40] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.019 second response time [22:58:14] Or change MediaWiki:mainpage to Main_Page [22:59:21] there is a Main_Page there [22:59:25] say the dumps :-P [22:59:50] maybe it's a matter of the content language, like you say [22:59:59] MW doing something bizarre? [23:00:04] Well I never [23:00:06] * Reedy grins [23:00:15] heh [23:00:20] en/es issue? [23:00:26] prolly [23:00:36] been way too long since I've been in volved with one of these though [23:00:43] adding a wiki, that is [23:01:25] https://el.wiktionary.org/w/index.php?title=Main_Page&redirect=no [23:01:28] yep redirect [23:03:06] mediawiki could be smart enough to check content language and make the Main_Page redirect to the placeholder content inserted at $translated_title if content_language is not en... [23:05:42] apergos i agree [23:06:32] that btw is probably the fastest the first dump of a wiki has ever been delivered :-P [23:06:43] all right, I'm out.. see folks on us election day :-P [23:07:49] nice apergos :) [23:12:39] mutante, i am going to try and upload a new parsoid deb (0.6.0) with the latest master .. do you know if it will delete 0.5.3 or retain it? Last time, the 0.5* and 0.4.1 were left behind. [23:13:09] (03PS1) 10Dereckson: Update interwiki map for vote. and ec.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320308 [23:13:30] (03CR) 10Dereckson: [C: 032] Update interwiki map for vote. and ec.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320308 (owner: 10Dereckson) [23:14:01] (03Merged) 10jenkins-bot: Update interwiki map for vote. and ec.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320308 (owner: 10Dereckson) [23:14:13] subbu: i think it will retain it but users installing will all get the latest version [23:14:29] that is fine. [23:14:48] just in case someone wants just a security update and not a whole update .. good to have the 0.5.3 around for a bit. [23:14:52] thanks. will upload now. [23:15:05] !log mwscript --deleteEqualMessages.php --wiki gawiktionary (T45917) [23:15:08] New interwiki map works on mw1099, syncing. [23:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:10] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [23:16:20] mutante, it removed 0.5.3 :( [23:16:27] !log dereckson@tin Synchronized wmf-config/interwiki.php: Update interwiki map for vote. and ec.wikimedia ([[Gerrit:320308]]) (duration: 00m 47s) [23:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:49] i guess we'll have to figure out where to host our old deb releases .. since the next release after 0.6.0 will be a big breaking release. [23:19:29] subbu: hrmm.. but like you said, in the past the old versions stuck around and they were only deleted with a manual command .. maybe it has to be a new distribution [23:19:30] !log Created storage container for ec.wikimedia (private) [23:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:53] subbu: maybe "jessie-mediawiki-old" is needed [23:20:23] mutante, hmm ... looks like varnish might need a purge again .. i am still getting a 0.5.3 when i do a sudo apt-get install parsoid [23:22:41] !log ec.wikimedia.org wiki creation done [23:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:53] !log Starting projectcom.wikimedia.org wiki creation [23:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:04] (03PS2) 10Dereckson: Initial configuration for projectcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320298 (https://phabricator.wikimedia.org/T143138) [23:25:20] PROBLEM - HHVM rendering on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [23:25:21] subbu: i'll try to find out the commands that godog ran [23:25:27] k [23:26:20] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 76130 bytes in 0.349 second response time [23:27:07] (03CR) 10Dereckson: [C: 032] Initial configuration for projectcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320298 (https://phabricator.wikimedia.org/T143138) (owner: 10Dereckson) [23:27:41] (03Merged) 10jenkins-bot: Initial configuration for projectcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320298 (https://phabricator.wikimedia.org/T143138) (owner: 10Dereckson) [23:30:22] mutante: they'd be on neodymium for the purges, odd though I don't remember having this problem [23:32:47] !log mwscript --deleteEqualMessages.php --wiki jawikibooks (T45917) [23:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:52] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [23:34:29] Krenair: we've got a winner! addWiki completed with success at first run. [23:34:42] Dereckson, congratulations [23:35:07] I'm not going to count this one because it's apparently a double wiki creation day :p [23:35:18] !log projectcomwiki database created [23:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:36] you fixed the errors that showed up on the first creation? [23:36:10] No, I'll fill a bug later about that one [23:36:27] (03PS1) 10BBlack: VCL: retry explicit 503 once as well [puppet] - 10https://gerrit.wikimedia.org/r/320310 [23:37:38] Dereckson, so how did you get it to work the second time round? [23:37:57] ema: merging https://gerrit.wikimedia.org/r/#/c/320310/ because it seems pretty legit, and likely will reduce the minor 503 spikes we see on e.g. upload be restarts, etc [23:38:01] oops [23:38:42] (03CR) 10BBlack: [C: 032] VCL: retry explicit 503 once as well [puppet] - 10https://gerrit.wikimedia.org/r/320310 (owner: 10BBlack) [23:39:04] Krenair: running it again, commenting the done steps, and that time Elasticsearch index completed [23:41:17] !log dereckson@tin Synchronized dblists/: Added projectcomwiki (duration: 00m 48s) [23:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:07] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: Added projectcomwiki [23:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:31] https://projectcom.wikimedia.org/wiki/Main_Page is live [23:43:36] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configuration for projectcom.wikimedia.org (duration: 00m 53s) [23:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:57] !log Created storage container for projectcomwiki (private) [23:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:39] !log mwscript --deleteEqualMessages.php --wiki jawikinews (T45917) [23:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:44] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [23:54:30] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:55:02] !log delete parsoid from releases.wikimedia.org and varnish-ban on cache_misc [23:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:16] !log Created 'Mjohnson (WMF)' user account on projectcom.wikimedia.org as bureaucrat [23:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:55] Hi!